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Preface 



Field-programmable gate arrays (FPGAs) are on the verge of revolutionizing 
digital signal processing in the manner that programmable digital signal pro- 
cessors (PDSPs) did nearly two decades ago. Many front-end digital signal 
processing (DSP) algorithms, such as FFTs, FIR or HR filters, to name just 
a few, previously built with ASICs or PDSPs, are now most often replaced 
by FPGAs. Modern FPGA families provide DSP arithmetic support with 
fast-carry chains (Xilinx Virtex, Altera FLEX) that are used to implement 
multiply-accumulates (MACs) at high speed, with low overhead and low costs 
[1]. Previous FPGA families have most often targeted TTL “glue logic” and 
did not have the high gate count needed for DSP functions. The efficient 
implementation of these front-end algorithms is the main goal of this book. 

At the beginning of the twenty-first century we find that the two pro- 
grammable logic device (PLD) market leaders (Altera and Xilinx) both re- 
port revenues greater than US$1 billion. FPGAs have enjoyed steady growth 
of more than 20% in the last decade, outperforming ASICs and PDSPs by 
10%. This comes from the fact that FPGAs have many features in com- 
mon with ASICs, such as reduction in size, weight, and power dissipation, 
higher throughput, better security against unauthorized copies, reduced de- 
vice and inventory cost, and reduced board test costs, and claim advantages 
over ASICs, such as a reduction in development time (rapid prototyping), 
in-circuit reprogrammability, lower NRE costs, resulting in more econom- 
ical designs for solutions requiring less than 1000 units. Compared with 
PDSPs, FPGA design typically exploits parallelism, e.g., implementing multi- 
ple multiply- accumulate calls efficiency, e.g., zero product-terms are removed, 
and pipelining, i.e., each LE has a register, therefore pipelining requires no 
additional resources. 

Another trend in the DSP hardware design world is the migration from 
graphical design entries to hardware description language (HDL). Although 
many DSP algorithms can be described with “signal flow graphs,” it has been 
found that “code reuse” is much higher with HDL-based entries than with 
graphical design entries. There is a high demand for HDL design engineers 
and we already find undergraduate classes about logic design with HDLs [2]. 
Unfortunately two HDL languages are popular today. The US west coast and 
Asia area prefer Verilog, while US east coast and Europe more frequently 
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use VHDL. For DSP with FPGAs both languages seem to be well suited, 
although some VHDL examples are a little easier to read because of the sup- 
ported signed arithmetic and multiply /divide operations in the IEEE VHDL 
1076-1987 and 1076-1993 standards. The gap is expected to disappear after 
approval of the Verilog IEEE standard 1364-1999, as it also includes signed 
arithmetic. Other constraints may include personal preferences, EDA library 
and tool availability, data types, readability, capability, and language exten- 
sions using PLIs, as well as commercial, business, and marketing issues, to 
name just a few [3]. Tool providers acknowledge today that both languages 
have to be supported and this book covers examples in both design languages. 

We are now also in the fortunate situation that “baseline” HDL compilers 
are available from different sources at essentially no cost for educational use. 
We take advantage of this fact in this book. It includes a CD-ROM with 
Altera’s newest MaxPlusII software, which provides a complete set of design 
tools, from a content-sensitive editor, compiler, and simulator, to a bitstream 
generator. All examples presented are written in VHDL and Verilog and 
should be easily adapted to other propriety design-entry systems. Xilinx’s 
“Foundation Series,” ModelTech’s ModelSim compiler, and Synopsys FC2 or 
FPGA Compiler should work without any changes in the VHDL or Verilog 
code. 

The book is structured as follows. The first chapter starts with a snapshot 
of today’s FPGA technology, and the devices and tools used to design state- 
of-the-art DSP systems. It also includes a detailed case study of a frequency 
synthesizer, including compilation steps, simulation, performance evaluation, 
power estimation, and floor planning. This case study is the basis for more 
than 30 other design examples in subsequent chapters. The second chapter 
focuses on the computer arithmetic aspects, which include possible number 
representations for DSP FPGA algorithms as well as implementation of basic 
building blocks, such as adders, multipliers, or sum-of-product computations. 
At the end of the chapter we discuss two very useful computer arithmetic con- 
cepts for FPGAs: distributed arithmetic (DA) and the CORDIC algorithm. 
Chapters 3 and 4 deal with theory and implementation of FIR and HR fil- 
ters. We will review how to determine filter coefficients and discuss possible 
implementations optimized for size or speed. Chapter 5 covers many concepts 
used in multirate digital signal processing systems, such as decimation, inter- 
polation, and filter banks. At the end of Chap. 5 we discuss the various pos- 
sibilities for implementing wavelet processors with two-channel filter banks. 
In Chap. 6, implementation of the most important DFT and FFT algorithms 
is discussed. These include Rader, chirp-z, and Goertzel DFT algorithms, as 
well as Cooley-Tuckey, Good-Thomas, and Winograd FFT algorithms. In 
Chap. 7 we discuss more specialized algorithms, which seem to have great 
potential for improved FPGA implementation when compared with PDSPs. 
These algorithms include number theoretic transforms, algorithms for cryp- 
tography and errorcorrection, and communication system implementations. 
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IX 



The appendix includes an overview of the VHDL and Verilog languages, the 
examples in Verilog HDL, and a short introduction to the utility programs 
included on the CD-ROM. 
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Meyer-Baese. 
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1. Introduction 



This chapter gives an overview of the algorithms and technology we will 
discuss in the book. It starts with an introduction to digital signal processing 
and we will then discuss FPGA technology in particular. Finally, the Altera 
EPF10K70 and a larger design example, including chip synthesis, timing 
analysis, floorplan, and power consumption, will be studied. 



1.1 Overview of Digital Signal Processing (DSP) 

Signal processing has been used to transform or manipulate analog or digital 
signals for a long time. One of the most frequent applications is obviously 
the filtering of a signal, which will be discussed in Chaps. 3 and 4. Digital 
signal processing has found many applications, ranging from data communi- 
cations, speech, audio or biomedical signal processing, to instrumentation and 
robotics. Table 1.1 gives an overview of applications where DSP technology 
is used [6]. 

Digital signal processing (DSP) has become a mature technology and has 
replaced traditional analog signal processing systems in many applications. 
DSP systems enjoy several advantages, such as insensitivity to change in 
temperature, aging, or component tolerance. Historically, analog chip design 
yielded smaller die sizes, but now, with the noise associated with modern 
submicrometer designs, digital designs can often be much more densely in- 
tegrated than analog designs. This yields compact, low-power, and low-cost 
digital designs. 

Two events have accelerated DSP development. One is the disclosure by 
Cooley and Tuckey (1965) of an efficient algorithm to compute the discrete 
Fourier Transform (DFT). This class of algorithms will be discussed in detail 
in Chapter 6. The other milestone was the introduction of the programmable 
digital signal processor (PDSP) in the late 1970s. This could compute a 
(fixed-point) “multiply-and- accumulate” in only one clock cycle, which was 
an essential improvement compared with the “Von Neuman” microprocessor- 
based systems in those days. Modern PDSPs may include more sophisticated 
functions, such as floating-point multipliers, barrelshifters, memory banks, or 
zero-overhead interfaces to A/D and D/A converters. EDN publishes every 
year a detailed overview of available PDSPs [7]. Figgure 1.1 shows a typical 
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Table 1.1. Digital signal processing applications. 



Area 


DSP algorithm 




General purpose 


Filtering and convolution, adaptive filtering, detection 
and correlation, spectral estimation and Fourier trans- 
form 


Speech processing 


Coding and decoding, encryption and decryption, 
recognition and synthesis, speaker identification 
cancellation, cochlea-implant signal processing 


speech 
, echo 


Audio processing 


hi-fi encoding and decoding, noise cancellation, 
equalization, ambient acoustics emulation, audio ] 
and editing, sound synthesis 


audio 

mixing 


Image processing 


Compression and decompression, rotation, image 
mission and decompositioning, image recognition, 
enhancement, retina-implant signal processing 


trans- 

image 



Voice mail, facsimile (fax), modems, cellular telephones, 
modulators/demodulators, line equalizers, data encryp- 
Information systems tion and decryption, digital communications and LANs, 
spread- spectrum technology, wireless LANs, radio and 
television, biomedical signal processing 



Control 


Servo control, disk control, printer control, engine con- 
trol, guidance and navigation, vibration control, power- 
system monitors, robots 


Instrumentation 


Beamforming, waveform generation, transient analysis, 
steady-state analysis, scientific instrumentation, radar 
and sonar 



application used to implement an analog system by means of a PDSP. We 
will return in Sect. 1.2.1 and Chap. 2 (p. 90) to PDSPs after we have studied 
FPGA architectures. 




Fig. 1.1. A typical DSP application. 




1.2 FPGA Technology 
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1.2 FPGA Technology 

VLSI circuits can be classified as shown in Fig. 1.2. FPGAs are a member 
of a class of devices called field-programmable logic (FPL). FPLs are defined 
as programmable devices containing repeated fields of small logic blocks and 
elements 2 . It can be argued that an FPGA is an ASIC technology since 
FPGAs are application-specific ICs. It is, however, generally assumed that the 
design of a classic ASIC required additional semiconductor processing steps 
beyond those required for an FPL. The additional steps provide higher-order 
ASICs with their performance advantage, but also with high nonrecurring 
engineering (NRE) costs. Gate arrays, on the other hand, typically consist of 
a “sea of NAND gates” whose functions are customer provided in a “wire list.” 
The wire list is used during the fabrication process to achieve the distinct 
definition of the final metal layer. The designer of a programmable gate array 
solution, however, has full control over the actual design implementation 
without the need (and delay) for any physical IC fabrication facility. 

1.2.1 Classification by Granularity 

Logic block size correlates to the granularity of a device that, in turn, relates 
to the effort required to complete the wiring between the blocks (routing 
channels). In general three different granularity classes can be found: 

• Fine granularity (Pilkington or “sea of gates” architecture) 

• Medium granularity (FPGA) 

• Large granularity (CPLD) 



Fine- Granularity Devices 

Fine-grain devices were first licensed by Plessey and later by Motorola, being 
supplied by Pilkington Semiconductor. The basic logic cell consisted of a 
single NAND gate and a latch (see Fig. 1.3). Because it is possible to realize 
any binary logic function using NAND gates (see Exercise 1.1, p. 27), NAND 
gates are called universal functions. This technique is still in use for gate array 
designs along with approved logic synthesis tools, such as ESPRESSO. Wiring 
between gate- array NAND gates is accomplished by using additional metal 
layer (s). For programmable architectures, this becomes a bottleneck because 
the routing resources used are very high compared with the implemented 
logic functions. In addition, a high number of NAND gates is needed to build 
a simple DSP object. A fast 4-bit adder, for example, uses about 130 NAND 
gates. This makes fine-granularity technologies unattractive in implementing 
most DSP algorithms. 

2 Called configurable logic block (CLB) by Xilinx, logic cell (LC) or logic elements 
(LE) by Altera. 
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classic ASIC 
ASIC 



Fig. 1.2. Classification of VLSI circuits (©1995 VDI Press [4]). 



Medium-Granularity Devices 

The most common FPGA architecture is shown in Fig. 1.4a. A concrete ex- 
ample of a contemporary medium-grain FPGA device is shown in Fig. 1.5. 
The elementary logic blocks are typically small tables (e.g., Xilinx Virtex 
with 4- to 5-bit input tables, 1- or 2-bit output), or are realized with ded- 
icated multiplexer (MPX) logic such as that used in Actel ACT-2 devices 
[9]. Routing channel choices range from short to long. A programmable I/O 
block with flip-flops is attached to the physical boundary of the device. 

Large- Granularity Devices 

Large granularity devices, such as complex programmable logic devices 
(CPLD), are characterized in Fig. 1.4b. They are defined by combining so- 
called simple programmable logic devices (SPLDs), like the classic GAL16V8 
shown in Fig. 1.6. This SPLD consists of a programmable logic array (PLA) 
implemented as an AND/OR array and a universal I/O logic block. The 
SPLDs used in CPLDs typically have 8 to 10 inputs, 3 to 4 outputs, and 
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Fig. 1.3. Plessey ERA60100 architecture with 10K NAND logic blocks [8]. (a) 
Elementary logic block, (b) Routing architecture (©1990 Plessey). 



support around 20 product terms. Between these SPLD blocks wide busses 
(called programmable interconnect arrays (PI As) by Altera) with short de- 
lays are available. By combining the bus and the fixed SPLD timing, it is 
possible to provide predictable and short pin-to-pin delays with CPLDs. 



1.2.2 Classification by Technology 

FPLs are available in virtually all memory technologies: SRAM, EPROM, 
E 2 PROM, and antifuse [10]. The specific technology defines whether the de- 
vice is reprogrammable or one-time programmable. Most SRAM devices can be 
programmed by a single-bit stream that reduces the wiring requirements, but 
also increases programming time (typically in the ms range). SRAM devices, 
the dominate technology for FPGAs, are based on static CMOS memory 
technology, and are re- and in-system programmable. They require, how- 
ever, an external “boot” device for configuration. Electrically programmable 
read-only memory (EPROM) devices are usually used in a one-time CMOS 
programmable mode because of the need to use ultraviolet light for erasure. 
CMOS electrically erasable programmable read-only memory (E 2 PROM) can 
be used as re- and in-system programmable. EPROM and E 2 PROM have the 
advantage of a short setup time. Because the programming information is 
not “downloaded” to the device, it is better protected against unauthorized 
use. A recent innovation, based on an EPROM technology, is called “flash” 
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(a) 



Fig. 1.4. (a) FPGA and (b) CPLD architecture (©1995 VDI Press [4]). 

memory. These devices are usually viewed as “pagewise” in-system repro- 
grammable systems with physically smaller cells, equivalent to an E 2 PROM 
device. Finally, the important advantages and disadvantages of different de- 
vice technologies are summarized in Table 1.2. 

1.2.3 Benchmark for FPLs 

Providing objective benchmarks for FPL devices is a nontrivial task. Perfor- 
mance is often predicated on the experience and skills of the designer, along 
with design tool features. To establish valid benchmarks, the Programmable 
Electronic Performance Cooperative (PREP) was founded by Xilinx [11], Al- 
tera [12], and Actel [13], and has since expanded to more than 10 members. 
PREP has developed nine different benchmarks for FPLs that are summa- 
rized in Table 1.3. The central idea underlining the benchmarks is that each 
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Fig. 1.5. Example of a medium-grain device (©1993 Xilinx). 



vendor uses its own devices and software tools to implement the basic blocks 



Table 1.2. FPL technology. 



Technology 


SRAM 


EPROM 


e 2 prom 


Antifuse 


Flash 


Repro- 

grammable 


/ 


/ 


/ 


— 


/ 


In-system 

programmable 


/ 


— 


/ 


— 


/ 


Volatile 


/ 


- 


- 


- 


- 


Copy 

protected 


- 


/ 


/ 


/ 


/ 


Examples 


Xilinx 

XC4K 

Altera 

Flex 


Altera 

MAX5K 

Xilinx 

XC7K 


AMD 

MACH 

Altera 
MAX 9K 


Actel 

ACT 


Xilinx 

XC9500 

Cypress 
Ultra 37K 




1. Introduction 




19 



18 



17 




Fig. 1.6. The GAL16V8. (a) First three of eight macrocells, (b) The output logic 
macrocell (OLMC) (©1997 Lattice). 



as many times as possible in the specified device, while attempting to max- 
imize speed. The number of instantiations of the same logic block within 
one device is called the repetition rate and is the basis for all benchmarks. 
For DSP comparisons, benchmarks five and six of Table 1.3 are relevant. 
In Fig. 1.7, repetition rates are reported over frequency, for typical Actel 
(A/e), Altera (o/e), and Xilinx (x*) devices. It can be concluded that modern 
FPGA families provide the best DSP complexity and maximum speed. This 
is attributed to the fact that modern devices provide fast-carry logic (see 
Sect. 1.4.1, p. 16) with delays (less than 0.5ns per bit) that allow fast adders 
with large bit width, without the need for expensive “carry look- ahead” de- 
coders. Although PREP benchmarks are useful to compare equivalent gate 
counts and maximum speeds, for concrete applications additional attributes 
are also important. They include: 

• Array multiplier (e.g., 18 x 18 bits) 

• Embedded hardwired microprocessor (e.g. 32-bit RISC PowerPC) 

• On-chip RAM or ROM (LC or large block size) 

• External memory support for ZBT, DDR, QDR, SDRAM 

• Pin-to-pin delay 

• Internal tristate bus 
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Table 1.3. The PREP benchmarks for FPLs. 



Number 


Benchmark name 


Description 


1 


Data path 


Eight 4-to-l multiplexers drive a 
parallel-load 8- bit shift register 


2 


Timer/counter 


Two 8- bit values are clocked 
through 8-bit value registers 
and compared 


3 


Small state 
machine 


An 8-state machine with 8 
inputs and 8 outputs 


4 


Large state 
machine 


A 16-state machine with 40 
transitions, 8 inputs, and 8 outputs 


5 


Arithmetic 

circuit 


A 4-by-4 unsigned multiplier 
and 8-bit accumulator 


6 


16-bit accumulator 


A 16-bit accumulator 


7 


16-bit counter 


Loadable binary up counter 


8 


16-bit synchronous 
prescaled counter 


Loadable binary counter 
with asynchronous reset 


9 


Memory 

mapper 


The map decodes a 16-bit 
address space into 8 ranges 



• Readback- or boundary-scan decoder 

• Programmable slew rate or voltage of I/O 

• Power dissipation 

• Ultra-high speed serial interfaces 

Some of these features are (depending on the specific application) more 
relevant to DSP application than others. We summarize the availability of 
some of these key features in Table 1.4. The first column shows the vendor, 
followed by the device family name. The columns 3 — 8 show the (for most DSP 
applications) relevant features: (3) the support of fast-carry logic for adder 
or subtractor, (4) the embedded array multiplier of 18 x 18 bit width, (5) the 
on-chip RAM implemented with the LCs, (6) the on-chip large memory block 
of size larger than 1 kbit, (7) embedded microprocessor: IBM’s PowerPC on 
Xilinx or the ARM processor available with Altera devices, and (8) the target 
price and availability of the device family. Discontinued are marked with d. 
Low-cost devices have a single $, medium price range devices have two $$, 
and high price range devices have three $$$. 

Figure 1.8 summarizes the power dissipation of some typical FPL devices. 
It can be seen that CPLDs (Altera) usually have higher “standby” power 
consumption. For higher-frequency applications, FPGAs (Xilinx and Actel) 
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Table 1.4. Altera and Xilinx FPGA family DSP features. 



Vendor 


Family 






Feature 










Fast 

adder 

carry 

logic 


Emb. 

mult. 

18x18 

bits 


LC 

RAM 


Large 

Block 

RAM 


Emb. 


Low 

cost/ 

discon- 

tinued 


Xilinx 


XC2000 


_ 


_ 


_ 


_ 


_ 


d 


Xilinx 


XC3000 


— 


— 


— 


— 


— 


d 


Xilinx 


XC4000 


/ 


— 


/ 


— 


— 


d 


Xilinx 


Spartan-XL 


/ 


— 


/ 


— 


— 


$ 


Xilinx 


Spartan II 


/ 


— 


/ 


/ 


— 


$ 


Xilinx 


Spartan III 


/ 


/ 


/ 


/ 


— 


$ 


Xilinx 


Virtex 


/ 


_ 


/ 


/ 


— 


$$ 


Xilinx 


Virtex II 


/ 


/ 


/ 


/ 


— 


$$ 


Xilinx 


Virtex II Pro 


/ 


/ 


/ 


/ 


/ 


$$$ 


Altera 


FLEX8K 


/ 


_ 


— 


— 


_ 


d 


Altera 


FLEX10K 


/ 


— 


— 


/ 


— 


$$ 


Altera 


APEX20K 


/ 


— 


— 


/ 


— 


$$ 


Altera 


APEX II 


/ 


— 


— 


/ 


— 


$$ 


Altera 


ACEX 


/ 


— 


— 


/ 


— 


$ 


Altera 


Cyclone 


/ 


— 


— 


/ 


— 


$ 


Altera 


Stratix 


/ 


/ 


— 


/ 


— 


$$$ 


Altera 


Mercury 


/ 


— 


— 


/ 


— 


$$ 


Altera 


Excalibur 


/ 


- 


- 


/ 


/ 


$$$ 



can be expected to have a higher power dissipation. A detailed power analysis 
example can be found in Sect. 1.4.2, p. 21. 



1.3 DSP Technology Requirements 

The PLD market share, by vendor, is presented in Fig. 1.9. PLDs, since 

their introduction in the early 1980s, have enjoyed in the last decade steady 

growth of 20% per annum, outperforming ASIC growth by more than 10%). 
Since 2001 the worldwide recession in microlectronics has reduced the ASIC 
and FPLD growth essentially. The reason that FPLDs outperformed ASICs 
seems to be related to the fact that FPLs can offer many of the advantages 
of ASICs such as: 

• Reduction in size, weight, and power dissipation 

• Higher throughput 

• Better security against unauthorized copies 

• Reduced device and inventory cost 

• Reduced board test costs 

without many of the disadvantages of ASICs such as: 
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Fig. 1.7. Benchmarks for FPLs (©1995 VDI Press [4]). 



• A reduction in development time (rapid prototyping) by three to four 

• In-circuit reprogrammability 

• Lower NRE costs resulting in more economical designs for solutions requir- 
ing less than 1000 units 

CBIC ASICs are used in high-end, high- volume applications (more than 
1000 copies). Compared to FPLs, CBIC ASICs typically have about ten times 
more gates for the same die size. An attempt to solve the second problem is 
the so-called hard- wired FPGA, where a gate array is used to implement a 
verified FPGA design. 

1.3.1 FPGA and Programmable Signal Processors 

General-purpose programmable digital signal processors (PDSPs) [6, 14, 15] 
have enjoyed tremendous success for the last two decades. They are based 
on a reduced instruction set computer (RISC) paradigm with an architecture 
consisting of at least one fast array multiplier (e.g., 16 x 16-bit to 24 x 24-bit 
fixed-point, or 32-bit floating-point), with an extended wordwidth accumu- 
lator. The PDSP advantage comes from the fact that most signal processing 
algorithms are multiply and accumulate (MAC) intensive. By using a mul- 
tistage pipeline architecture, PDSPs can achieve MAC rates limited only by 
the speed of the array multiplier. It can be argued that an FPGA can also 
be used to implement MAC cells [16], but cost issues will most often give 
PDSPs an advantage, if the PDSP meets the desired MAC rate. On the 
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Fig. 1.8. Power dissipation for FPLs (©1995 VDI Press [4]). 



other hand we now find many high-bandwidth signal-processing applications 
such as wireless, multimedia, or satellite transmission, and FPGA technol- 
ogy can provide more bandwidth through multiple MAC cells on one chip. 
In addition, there are several algorithms such as CORDIC, NTT or error- 
correction algorithms, which will be discussed later, where FPL technology 
has been proven to be more efficient than a PDSP. It is assumed [17] that in 
the future PDSPs will dominate applications that require complicated algo- 
rithms (e.g., several if-then-else constructs), while FPGAs will dominate 
more front-end (sensor) applications like FIR filters, CORDIC algorithms, or 
FFTs, which will be the focus of this book. 



1.4 Design Implementation 

The levels of detail commonly used in VLSI designs range from a geomet- 
rical layout of full custom ASICs to system design using so-called set-top 
boxes. Table 1.5 gives a survey. Layout and circuit-level activities are absent 
from FPGA design efforts because their physical structure is programmable 
but fixed. The best utilization of a device is typically achieved at the gate 
level using register transfer design languages. Time-to-market requirements, 
combined with the rapidly increasing complexity of FPGAs, are forcing a 
methodology shift towards the use of “intellectual property” (IP) macrocells 
or “mega-core cells.” Macrocells provide the designer with a collection of pre- 
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Fig. 1.9. Revenues of the top five vendors in the PLD/FPGA/CPLD market. 



Table 1.5. VLSI design levels. 



Object 


Objectives 


Example 


System 


Performance specifications 


Computer, disk unit, radar 


Chip 


Algorithm 


pP, RAM, ROM, UART, parallel port 


Register 


Data flow 


Register, ALU, COUNTER, MUX 


Gate 


Boolean equations 


AND, OR, XOR, FF 


Circuit 


Differential equations 


Transistor, R, L, C 


Layout 


None 


Geometrical shapes 



defined functions, such as microprocessors or UARTs. The designer, therefore, 
need only to specify selected features and attributes (i.e. , accuracy), and a 
“synthesizer” will generate a hardware description code or schematic for the 
resulting solution. 

A key point in FPGA technology is, therefore, powerful design tools to 

• Shorten the design cycle 

• Provide good utilization of the device 

• Provide synthesizer options, i.e., choose between optimization speed versus 
size of the design 
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Fig. 1.10. CAD design circle. 



A CAE tool taxonomy, as it applies to FPGA design flow, is presented 
in Fig. 1.10. In general, the decision whether to work within a graphical or a 
text design environment is a matter of personal taste and prior experience. 
A graphical presentation of a DSP solution can emphasize the highly regular 
dataflow associated with many DSP algorithms. The textual environment, 
however, is often preferred with regard to algorithm control design and al- 
lows a wider range of design styles as demonstrated in the following design 
example. Specifically, for Altera’s MaxPlusII, it seemed that with text de- 
sign more special attributes and more precise behavior can be assigned in the 
designs. 

Example 1.1: Comparison of VHDL Design Styles 

The following design example illustrates three design strategies in a VHDL 

context. Specifically, the techniques explored are: 

• Stuctural style (component instantiation, i.e., graphical netlist design) 

• Behavioral style 
— Data flow 

- Sequential design using PROCESS templates 
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The VHDL design file example. vhd 4 follows (comments start with — ): 
PACKAGE eight_bit_int IS — User defined type 
SUBTYPE BYTE IS INTEGER RANGE -128 TO 127; 

END eight_bit_int ; 



LIBRARY work; 

USE work . eight_bit_int .ALL; 

LIBRARY 1pm; — Using predefined packages 

USE lpm. lpm_ components. ALL; 



LIBRARY ieee ; 

USE ieee . std_logic_1164 .ALL; 
USE ieee . std_logic_arith. ALL; 



ENTITY example IS > Interface 

GENERIC (WIDTH : INTEGER := 8); — Bit width 



PORT (elk 
a, b 
opl 
sum 
d 

END example; 



IN STD.LOGIC; 

IN BYTE; 

IN STD_L0G IC_VECT0R( WIDTH- 1 DOWNTO 0) ; 
OUT STD_LOGIC_VECTOR (WIDTH- 1 DOWNTO 0) ; 
OUT BYTE) ; 



ARCHITECTURE flex OF example IS 

SIGNAL c, s : BYTE; — Auxiliary variables 

SIGNAL op2 , op3 : STD_LOGIC_VECTOR (WIDTH-1 DOWNTO 0) ; 

BEGIN 



— Conversion int -> logic vector 
op2 <= CONV_STD_LOGIC_VECTOR(b, 8) ; 

addl : lpm_add_sub > Component instantiation 

GENERIC MAP (LPM.WIDTH => WIDTH, 

LPM.REPRESENTATION => "SIGNED", 
LPM.DIRECTION => "ADD") 

PORT MAP (dataa => opl, 
datab => op2 , 
result => op3) ; 

regl: lpm_ff 

GENERIC MAP (LPM.WIDTH => WIDTH ) 

PORT MAP (data => op3, 
q => sum, 
clock => elk) ; 



c <= a + b ; > Data flow style 

pi: PROCESS > Behavioral style 

BEGIN 

4 The equivalent Verilog code example. v for this example can be found in Ap- 
pendix A on page 435. 
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WAIT UNTIL elk = >1’ ; 

s <= c + s; > Signal assignment statement 

END PROCESS pi; 
d <= s; 

END flex; 



ED 



After a successful functional (only) simulation of the design (for the 
MaxPlusII compiler mode select the option Processing^Functional SNF 
Extractor) we can proceed and start with the design implementation as 
reported in Fig. 1.10. To do this with the MaxPlusII compiler, we choose 
Process ing— ^Timing SNF Extractor, and we will then see that the com- 
piler window now has three more entries, namely Logic Synthesizer, 
Fitter, and Timing SNF Extractor. After starting the compiler we can 
then conduct a simulation with timing, check for glitches, or measure the 
Registered Performance of the design, to name just a few options. After 
all these steps are successful, and if a hardware board (like the Altera Uni- 
versity board) is available, we proceed with programming the device and 
may perform additional hardware tests using the “read back” methods, as 
reported in Fig. 1.10. 

1.4.1 FPGA Structure 

At the beginning of the twenty-first century two FPGA device families seemed 
to have the most attractive features for implementing DSP algorithms, due to 
the fact that these families provide fast-carry logic, which allows implementa- 
tions of 32-bit (nonpipelined) adders at speeds exceeding 50 MHz [1, 18, 19]. 

These two families are the Xilinx XC4000 family (and the newest deriva- 
tives, e.g., Spartan and Virtex) and the Altera FLEX 10K devices (and the 
newest derivatives, e.g., APEX, ACEX, Mercury, Stratix and Excalibur), 
which are Altera’s 8K devices with additional 2 kbit RAM blocks called 
embedded array blocks (EABs). The Xilinx devices have the wide range of 
routing levels typical in FPGAs, while the Altera devices were based on the 
architecture with wide busses used in Altera’s CPLDs. But the basic blocks 
of the FLEX 10K are no longer large PL As as in CPLD. Instead the devices 
now have medium granularity, i.e., small look-up tables (LUTs), as is typical 
for FPGAs. 

The basic logic elements of the Xilinx XC4000 family are called config- 
urable logic blocks (CLB) and have two separate 4-input 1-output LUTs, 
fast-carry, one additional 3-input 1-output LUT to combine the two separate 
LUTs, and two flip-flops, as shown in Fig. 1.11. The Xilinx device has five 
levels of routing, ranging from CLB to CLB, to long lines spanning the entire 
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Table 1.6. The Xilinx XC4000 family. 



Device 


Total 

CLBs 


Flip- 

hop 

bits 


Max. 

RAM 

kbits 


Max. 

I/O 


XC4003 


100 


360 


3.2 


80 


XC4005 


196 


616 


6.3 


112 


XC4010 


400 


1120 


12.8 


160 


XC4025 


1024 


2560 


32 


256 


XC4085 


3136 


7168 


100 


448 


XC40150 


5184 


11520 


165 


448 


XC40250 


8464 


18 400 


270 


448 




Fig. 1.11. XC4000 logic cell (©1993 Xilinx). 



chip. Each CLB can be used as 16x2- or 32 x 1-bit RAM or ROM. Tables 1.6 
shows some members of the Xilinx XC4000 family. 

The basic block of the Altera FLEX 10K device achieves a medium gran- 
ularity using small LUTs. The 10K device is an Altera 8K device with added 
2 kbit RAM blocks, called embedded array blocks (EABs). The basic logic 
element in Altera FLEX 10K devices is called a logic element (LE) 5 and 
consists of a flip-flop, a 4-input 1-output LUT, or 3-input 1-output and a 
fast-carry logic, or AND/OR product term expanders as shown in Fig. 1.12. 
Eight LCs are combined in a logic array block (LAB). Each row contains 
an embedded array block (EAB; i.e., a 2-kbit RAM or ROM) that can be 
configured as 256 x 8, 512 x 4, 1024 x 2, or 2048 x 1 memory devices. These 
EABs and LABs are connected through wide high-speed busses with 100 to 
300 lines per column as shown in Fig. 1.13. Table 1.7 shows some members 
of the Altera FLEX 10K family. 

Sometimes also called logic cells (LCs) in a “design report hie.” See example. rpt. 



5 
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Fig. 1.12. FLEX logic cell (©1996 Altera). 



If we compare the two routing strategies from Altera and Xilinx we find 
that both approaches have value: the Xilinx approach with more local and 
less global routing resources is synergistic to DSP use because most digital 
signal processing algorithms process the data locally. The Altera approach, 
with wide busses, also has value, because typically not only are single bits 
processed in “bit slice” operations, but normally wide data vectors with 16 
to 32 bits must be moved to the next DSP block. 



Table 1.7. The FLEX 10K family. 



Device 


Total 


Flip- 


EABs 


Max. 


Max. 




logic 


flop 




RAM 


I/O 




elements 


bits 




kbits 




EPF10K10 


576 


720 


3 


6 


134 


EPF10K20 


1152 


1344 


6 


12 


189 


EPF10K30 


1728 


1968 


6 


12 


246 


EPF10K40 


2304 


2576 


8 


16 


189 


EPF10K50 


2880 


3184 


10 


20 


310 


EPF10K70 


3744 


4096 


9 


18 


358 


EPF10K100 


4992 


5392 


12 


24 


406 


EPF10K130 


6656 


7120 


16 


32 


470 


EPF10K250 


12160 


12 624 


20 


40 


470 
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Embedded Array Block (EA8) 




Embedded Amy 



Fig. 1.13. Overall bus structure in FLEX 10K devices (©1996 Altera). 

1.4.2 The Altera EPF10K70RC240-4 

The Altera EPF10K70RC240-4 device, which is part of the UP2 demo board 
provided through Altera’s University Program, is used throughout this book. 
The device nomenclature is interpreted as follows: 

EPF10K70RC240-4 

I | | | -> 4 ns device 

I | | > Package and pin number 

I | > Equivalent gate count 

I > Device family 

Specific design examples will, wherever possible, target Altera devices 
using Altera-supplied software. The enclosed MaxPlusII software is a fully 
integrated system with VHDL and Verilog editor, synthesizer, simulator, and 
bitstream generator. Because all examples are available in VHDL and Verilog, 
any other simulator may also be used. For instance, the device-independent 
Synopsys FC2 or ModelTech compiler has successfully been used to compile 
the examples using the synthesizable code for 1pm functions on the CD-ROM 
provided by EDIF. 





20 



1. Introduction 



Logic Resources 

The EPF10K70 is a member of the Altera 10K family and has a gate com- 
plexity equivalent to about 70 000 two-input NAND gates. The maximum 
number of full adders that can be implemented may, however, be a more 
useful metric for DSP applications. From Table 1.7, it can be seen that the 
EPF10K70 device has 3744 basic logic elements (LEs). This is also the max- 
imum number of implementable full adders. Each LE can be used as a four- 
input LUT, or in the “arithmetic” mode, as a three-input LUT with an ad- 
ditional fast carry as shown in Fig. 1.12. Eight LEs are always combined into 
a logic array block (LAB). The number of LABs is therefore 3744/8=468. 
These 468 LABs are arranged in nine rows and 52 columns. The device also 
includes one 2-kbit memory block (called an embedded array block, or EAB) 
in the center of each row. The EPF10K70 has therefore nine EABs, or a total 
of 18 kbits of memory. Figure 1.13 presents part of the device floorplan. 

Routing Resources 

Each LAB has 26 inputs from each row and eight signals coming from the 
logic elements. There are four additional LAB control signals (e.g., preset 
of registers) and two local carry and cascade interconnects. To connect the 
LABs, the EPF10K70 uses fast, wide row and column busses, called “fast 
track interconnects.” Each row bus is 312 lines wide with 24 channels per 
column. For improved routability, Altera has divided the row interconnect 
into full-length (104 channels) and half-length channels (2 x 104 = 208 chan- 
nels) for a total of 3 x 104 = 312 channels. The half-length channels end 
toward the middle of the channel where the EABs are located. The EABs 
can access both half-length channels. It is also interesting to note that the 
long carry chains skip alternate rows, so that only each second LAB occupies 
the same carry chain (see Fig. 1.17, p. 25). 

Timing Estimates 

Altera’s MaxPlusII software calculates various timing data, such as the Delay 
Matrix, Registered Performance, and Setup/Hold Matrix. For a full de- 
scription of all timing parameters, refer to Altera’s web-page [19]. To achieve 
optimal performance, it is necessary to understand how the software physi- 
cally implements the design. It is useful, therefore, to produce a rough esti- 
mate of the solution and then determine how the design may be improved. 

Example 1.2: Speed of an 16-bit Adder 

Assume one is required to implement a 16-bit adder and estimate the design’s 
maximum speed. The adder can be implemented in two LABs, each using the 
fast-carry chain. The delay through the “same row” delay must be taken into 
account. The total delays are computed as follows: First, the two inputs 
must be stable t co • Next, the first carry t csen must be generated, followed by 
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seven more carries inside the first LAB. The signal then goes through the 
row interconnect t s amerow- Inside the second LAB, seven additional carries 
must be computed and the MSB then must run through an LUT to complete 
the sum. The results are then stored in the LE register. The following table 



summarizes these timing data: 

LE register clock- to- output delay t co = 1.4 ns 

Data-in to carry-out delay t cg en = 1.4 ns 

Carry-in to carry-out delay 7 x t c i co = 7 x 0.3 = 2.1ns 

Row routing delay t s amerow = 5.5ns 

Carry-in to carry-out delay Cico =7 x 0.3 = 2.1ns 

LE look-up table delay £lut = 2.0ns 

LE register setup time t su = 2.6ns 

Total = 17.1ns 

The estimated delay is 17.1ns, or a rate of 58.5 MHz. The design is expected 
to use about 16 LEs (see also Exercise 1.7, p. 28). | 1.2 | 



If the two LABs used can not be placed in the same row then the same- 
column delay tamecoiumn = 3.7 ns applies (instead of Gamerow)- The worst 
case occurs if the two LABs used are placed in different rows. The worst 
case delay becomes Gworows = 14.7 ns. It is therefore very important to check 
the floorplan and check for possible improvements “by hand” changes in the 
floorplan as described in the Altera “Getting Started” manual, pages 231—241 
[20], or see literature/manual/ 81 _gs . pdf on the CD-ROM. 

Power Dissipation 

The power consumption of an FPGA can be a critical design constraint, 
especially for mobile applications. Using 3.3 V or even lower voltage class 
devices is recommended in this case. To estimate the power dissipation of the 
Altera device EPF10K70RC240-4, three main sources must be considered, 
namely: 

1) Standby power dissipation /standby ^0.5 mA 

2) I/O power dissipation /j/o 

3) Active power dissipation /active 

The first two are not design dependent, and also the standby power in CMOS 
technology is generally small. The active current depends mainly on the clock 
frequency and the number of LEs in use. Altera provides the following em- 
pirical formula to estimate the active power dissipation: 

P — P\ ntern T P\( O — (/standby T /active) X FcC T P\jO 

m A. 

P — /active Fee — 85 X / ma x X N X TpE X -VF C , 

GHz x LE 

where / max is the maximum operating frequency in MHz, N is the total 
number of logic cells used in the device, and tle the average percent of logic 
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cells toggling at each clock (typically 12%). If, for instance, a design uses all 
LEs of the EPF10K70RC240-4 and the maximum frequency is 25 MHz, then 
the current will be estimated at 954 mA. 

The following case study should be used as a detailed scheme for the 
examples and self-study problems in subsequent chapters. 



1.4.3 Case Study: Frequency Synthesizer 

The design objective in the following case study is to implement a classical 
frequency synthesizer based on the Philips PM5190 model (circa 1979, see 
Fig. 1.14). The synthesizer consists of a 32-bit accumulator, with the eight 
most significant bits (MSBs) wired to a SIN-ROM lookup table (LUT) to 
produce the desired output waveform. A graphical solution, using Altera’s 
MaxPlusII software, is shown in Fig. 1.15, and can be found on the CD-ROM 
as book2e/vhdl/fun_graf .gdf . The following VHDL text file implements 
the design using “component instantiation,” consisting of 

1) Compilation of the design 

2) Design results and floor plan 

3) Simulation of the design, and 

4) A performance evaluation 




Fig. 1.14. PM5190 frequency synthesizer. 
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Fig. 1.15. Graphical design of frequency synthesizer. 



Design Compilation 

To check and compile the file, start the MaxPlusII Software and select 
File— )-Open to load f un_text . vhd. Notice that the top and left menus have 
changed. The VHDL design 6 reads as follows: 

— A 32 bit function generator using accumulator and ROM 
LIBRARY 1pm; 

USE 1pm. lpm_components . ALL; 

LIBRARY ieee; 

USE ieee . std_logic_1164 . ALL; 

USE ieee . std_logic_arith . ALL; 



ENTITY f un_text IS 
GENERIC ( WIDTH 
PORT ( M 

sin, acc 
elk 

END fun_text; 



: INTEGER := 32); — Bit width 

IN STD_L0GIC_VECT0R(WIDTH-1 DOWNTO 0) ; 
OUT STD_L0GIC_VECT0R(7 DOWNTO 0); 

IN STD.LOGIC); 



ARCHITECTURE fun_gen OF fun_text IS 

SIGNAL s, acc32 : STD_L0GIC_VECT0R(WIDTH-1 DOWNTO 0); 
SIGNAL msbs : STD_L0GIC_VECT0R(7 DOWNTO 0); 

— Auxiliary vectors 



The equivalent Verilog code fun_text.v for this example can be found in Ap- 
pendix A on page 436. 
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BEGIN 



addl: lpm_add_sub — Add M to acc32 

GENERIC MAP ( LPM.WIDTH => WIDTH, 

LPM.REPRESENTATION => "SIGNED", 
LPM.DIRECTION => "ADD", 
LPM.PIPELINE => 0) 

PORT MAP ( dataa => M, 

datab => acc32, 
result => s ) ; 

regl: lpm_ff — Save accu 

GENERIC MAP ( LPM.WIDTH => WIDTH) 

PORT MAP ( data => s, 
q => acc32, 
clock => elk) ; 

selectl: PROCESS (acc32) 

VARIABLE i : INTEGER; 

BEGIN 

FOR i IN 7 DOWNTO 0 LOOP 
msbs(i) <= acc32(31-7+i) ; 

END LOOP; 

END PROCESS selectl; 



acc <= msbs ; 
roml : lpm_rom 

GENERIC MAP ( LPM.WIDTH => 8, 

LPM_WIDTHAD => 8, 
LPM.FILE => "sine.mif") 
PORT MAP ( address => msbs, 
inclock => elk, 
outclock => elk, 
q => sin) ; 

END fun_gen; 



The object LIBRARY, found early in the code, contains predefined modules 
and definitions. The ENTITY block specifies I/O ports of the device and 
generic variables. Using component instantiation, three blocks (see labels 
addl, regl, roml) are called like subroutines. The “selectl” PROCESS con- 
struct is used to select the eight MSBs to address the ROM. To set the project 
to the current file, select File — »■ Project — >■ Set Project to Current 
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Fig. 1.16. Compilation steps in MaxPlusII. 



File. To optimize the design for speed, choose the menu Assign—^ Global 
Project Logic Synthesis option Optimize 10 (Speed), and set Global 
Project Synthesis Style to FAST. Set the device type to FLEX10K70 
by selecting in the menu Assign— >• Device for Device Family, the option 
FLEX10K. For Devices we select EPF10K70RC240-4. In order to be able to se- 
lect the speed grade 4 ns it may be necessary to deselect the option Show Only 
Fastest Speed Grades depending on the available devices. Next, start the 
syntax checker with <Ctrl+K> or by selecting File — > Project — »■ Save & 
Check. The compiler checks for basic syntax errors and produces the netlist 
file f un_text . enf . After the syntax check is successful, compilation can be 
started by pressing the START button in the compiler window or selecting 
File Project — >• Save & Compile. If all compiler steps were successfully 
completed, the design is fully implemented. Figure 1.16 summarizes all the 
processing steps of the compilation as shown in the MaxPlusII compiler win- 
dow. 



Floor Planing 

The design results can be verified by opening File^Open — >■ f un_text . rpt, 
or double click on the “rpt” button found in the compiler window (see 
Fig. 1.16). Under Utilities^ Find Text — >-LCs, find in “device summary” 
the number of LCs and memory blocks used. In the report file, find the pin-out 
of the device and the result of the logic synthesis (i.e., the logic equations). 




Fig. 1.17. Floorplan of frequency synthesizer design. 
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Fig. 1 . 18 . VHDL simulation of frequency synthesizer design. 



Check the memory initialization file sine.mif, containing the sine table in 
offset binary form. This file was generated using the program sine.exe in- 
cluded on the CD-ROM under book2e/util. Select MaxPlusII — > Floorplan 
Editor to view the physical implementation. Use the “reduce scale 1 ’ button 
to produce the screen shown in Fig. 1.17. Notice that the accumulator uses 
fast-carry chains, and that only every second column has been used for the 
improved routing as explained in Sect. 1.4. ‘2, p. 20. 

Simulation 

To simulate, open the prepared waveform File— >-0pen— >-f un_text . scf . No- 
tice that the top and left menu lines have changed. Set the time from the 
menu File— >End Time to 1 ps. In the fun_text.scf window, click on the 
elk symbol and set (left menu buttons) the Clock Period to 25 ns in the 
Overwrite Clock window. Set M = 715 827 883 (M — 2 32 / 6) , so that the pe- 
riod of the synthesizer is 6 clock cycles long. Start the simulation by selecting 
MaxPlusII — ^Simulator and press the start button. The simulation should 
give an output similar to Fig. 1.18. Notice that the ROM has been coded in 
binary offset (i.e., zero = 128). When complete, change the frequency so that 
a period of 8 cycles occurs, i.e., ( M — 2 32 / 8 ) , and repeat the simulation. 

Performance Analysis 

To initiate a performance analysis, enter the MaxPlusII— >Timing Analyzer. 
Note that the menu line has again changed. Select Analysis^Registered 
Performance and the appropriate Registered Performance screen will ap- 
pear. Click on the Start button to measure the register performance. The 
result should be similar to that shown in Fig. 1.19. 

This concludes the case study of the frequency synthesizer. 
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Fig. 1.19. Register performance of frequency synthesizer design. 



Exercises 

1.1: Use only two input NAND gates to implement a full adder: 

(a) s = a © 6 ® cin 
(Note: ®=XOR) 

(b) c 0 ut — a x b Ci n x (a b) 

(Note: +=OR; x=AND) 

(c) Show that the two-input NAND is universal by implementing NOT, AND, and 
OR with NAND gates. 

(d) Repeat (a)-(c) for the two input NOR gate. 

(e) Repeat (a)-(c) for the two input multiplexer / = xs' + ys. 



Exercises Using MaxPlusII 

1.2: (a) Compile the HDL file example using the MaxPlusII compiler (see p. 14) 
in the functional mode. Select as compiler option Process ing— ^Functional SNF 
Extractor. 

(b) Simulate the design using the file example. scf. 

Note: If you have no prior experience with the MaxPlusII software, refer to the 
case study found in Sect. 1.4.3, p. 22. 

(c) Compile the HDL file example using the MaxPlusII compiler with timing ex- 
traction. Select as compiler option Processing— ^Timing SNF Extractor. 

(d) Simulate the design using the file example. scf. 

(e) Turn on the option Check Outputs in the simulator window and compare the 
functional and implemented SNF. 
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1.3: (a) Generate a waveform file for clk,a,b,opl that approximates that shown 
in Fig. 1.20. 

(b) Conduct a simulation using the HDL code example. 

(c) Explain the algebraic relation between a,b,opl and sum,d. 

1.4: (a) Compile the HDL hie fun_text with the synthesis style (Assign— >-Global 
Project Logic Synthesis) Fast and Normal. 

(b) Evaluate Registered Performance and the LC’s utilization of the two designs 
from (a). Explain the results. 

1.5: (a) Compile the HDL hie fun_text with the synthesis style (Assign — > Global 
Project Logic Synthesis) Fast and compiler option Processings Timing SNF 
Extractor. 

Use the waveform hie fun_text.snf to check Setup/Hold, Check Ouputs, Glitch, 

and Oscillation. Set the period of the clock signal to 
(bl) 50ns. 

(b2) 20 ns. 

(b3) 15 ns. 

(b4) 10 ns. 

1.6: (a) Open the hie fun_text.scf and start the simulation. 

(b) Select the simulator window with the top menu line labelled Initialize. Select 
Initialize Memory and export the ROM table in Intel HEX format as sine. hex. 

(c) Change the fun_text HDL hie so that it uses the Intel HEX hie sine. hex for 
the ROM table, and verify the correct results through a simulation. 
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Fig. 1.20. Waveform hie for Example 1.1 on p. 14. 



1.7: (a) Design a 16-bit adder using the LPM_ADD_SUB macro with the MaxPlusII 
software. 

(b) Measure the Registered Performance and compare the result with the data 
from Example 1.2 (p. 20). 

1.8: (a) Design the PREP benchmark 5 shown in Fig. 1.21a with the MaxPlusII 
software. The design has a 4 x 4 unsigned array multiplier followed by an 8-bit ac- 
cumulator. If MAC = TRUE accumulation is performed otherwise S get the multiplier 
output, rst is an asynchronous reset and the 8-bit register is positive edge triggered 
via elk, see the simulation in Fig. 1.21c for the function test. 
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Fig. 1.21. PREP benchmark 5. 

(a) Single design, (b) Multiple instantiation, (c) Test bench to check function. 



(b) Measure the size and Registered Performance for a single copy. Select Global 
Project Logic Synthesis under the Assign menu. Try the following 4 different 
synthesis styles: Fast or Normal and for Optimize try Speed=0 or Speed=10. Which 
synthesis options are optimal for size or Registered Performance? 

Select one of the following devices: 

(bl) EPF10K20RC240-4. 

(b2) EPF10K70RC240-4. 

(b3) EPM7128LC84-7. 

(c) Design the multiple instantiation for benchmark 5 as shown in Fig. 1.21b. 

(d) Measure the size and Registered Performance for the design with the max- 
imum number of instantiations of PREP benchmark 5. Use the optimal synthesis 
option you found in (b) for the following devices: 

(dl) EPF10K20RC240-4. 

(d2) EPF10K70RC240-4. 

(d3) EPM7128LC84-7. 

1.9: (a) Design the PREP benchmark 6 shown in Fig. 1.22a with the MaxPlusII 
software. The design has a positive edge via elk triggered 16-bit accumulator and 
an asynchronous reset rst, see the simulation in Fig. 1.22c for the function test. 

(b) Measure the size and Registered Performance for a single copy. Select “Global 
Project Logic Synthesis” under the Assign menu. Try the following 4 different syn- 
thesis styles: Fast or Normal and for Optimize try Speed=0 or Speed=10. Which 
synthessis options are optimal for size or Registered Performance? 

Select one of the following devices: 

(bl) EPF10K20RC240-4. 

(b2) EPF10K70RC240-4. 

(b3) EPM7128LC84-7. 

(c) Design the multiple instantiation for benchmark 6 as shown in Fig. 1.22b. 

(d) Measure the size and Registered Performance for the design with the max- 
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Fig. 1.22. PREP benchmark 6. (a) Single design, (b) Multiple instantiation, (c) 
Test bench to check function. 



imum number of instantiations of PREP benchmark 6. Use the optimal synthesis 
option you found in (b) for the following devices: 

(dl) EPF10K20RC240-4. 

(d2) EPF10K70RC240-4. 

(d3) EPM7128LC84-7. 

1.10: Use the MaxPlusII software and write two different codes using the structural 
(use only one or two input basic gates, i.e., NOT, AND, and OR) and behavioral HDL 
styles for: 

(a) A 2:1 multiplexer. 

(b) A XNOR gate. 

(c) A half-adder. 

(d) A 2:4 decoder (demultiplexer). 

Note for VHDL designs: Use the a_74xx Altera component for the structural design 
files. In the Altera Help you find under “Old Style Macrofunctions” these compo- 
nents called SSI Functions. Because a component identifier can not start with a 
number Altera has added the a_ in front of each 74 series component. In order 
to find the names and data types for input and output ports you need to check 
the library file vhdl93\altera\raaxplus2 . vhd. You will find that the library uses 
STD_L0GIC data type and the names for the ports are a_l, a_2, and a_3 (if needed). 






2. Computer Arithmetic 



2.1 Introduction 

In computer arithmetic two fundamental design principles are of great impor- 
tance: number representation and the implementation of algebraic operations 
[21, 22, 23, 24, 25]. We will first discuss possible number representations, 
(e.g., fixed-point or floating-point), then basic operations like adder and mul- 
tiplier, and finally efficient implementation of more difficult operations such 
as square roots, and the computation of trigonometric functions using the 
CORDIC algorithm. 

FPGAs allow a wide variety of computer arithmetic implementations for 
the desired digital signal processing algorithms, because of the physical bit- 
level programming architecture. This contrasts with the programmable dig- 
ital signal processors (PDSPs), with the fixed multiply accumulator core. 
Careful choice of the bit width in FPGA design can result in substantial 
savings. 



NUMBER SYSTEMS 



Fixed-point 
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Fig. 2.1. Survey of number representations. 
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2.2 Number Representation 

Deciding whether fixed- or floating-point is more appropriate for the problem 
must be done carefully, preferably at an early phase in the project. In general, 
it can be assumed that fixed-point implementations have higher speed and 
lower cost, while floating-point has higher dynamic range and no need for 
scaling, which may be attractive for more complicated algorithms. Figure 2.1 
is a survey of conventional and less conventional fixed- and floating-point 
number representations. Both systems are covered by a number of standards 
but may, if desired, be implemented in a proprietary form. 

2.2.1 Fixed-Point Numbers 

We will first review the fixed-point number systems shown in Fig. 2.1. Table 

2.1 shows the 3-bit coding for the 5 different integer representations. 



Unsigned Integer 

Let X be an TV-bit unsigned binary number. Then the range is [0, 2 N — 1] 
and the representation is given by: 

N-l 

X=J2 *nr, ( 2 . 1 ) 

n = 0 

where x n is the n th binary digit of A" (i.e., x n £ [0, 1]). The digit £0 is called 
the least significant bit (LSB) and has a relative weight of unity. The digit 
£jv_i is the most significant bit (MSB) and has a relative weight of 2 iV_1 . 



Signed-Magnitude (SM) 



In signed-magnitude systems the magnitude and the sign are represented 
separately. The first bit x^-i (i.e., the MSB) represents the sign and the 
remaining N — 1 bits the magnitude. The representation becomes: 



£AcAn2" X>0 

-E1“o 2 *»2" *<0. 



( 2 . 2 ) 



The range of this representation is [— (2 7V_1 — 1), 2 N ~ l — 1]. The advantage of 
the signed-magnitude representation is simplified prevention of overflows, but 
the disadvantage is that addition must be split depending on which operand 
is larger. 
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Two’s Complement (2C) 

An TV-bit two’s complement representation of a signed integer, over the range 
[— 2 Ar_1 , 2 N_1 — 1], is given by: 



y f El _ o 2 ^2” X>0 

- j -2”- 1 *n2 n X<0 

The two’s complement (2C) system is by far the most popular signed 
numbering system in DSP use today. This is because it is possible to add 
several signed numbers, and as long as the final sum is in the TV-bit range, 
we can ignore any overflow in the arithmetic. For instance, if we add two 
3-bit numbers as follows 



(2.3) 



10 <— 


— >■ 0 1 I 2 C 


10 <— 


— >■ 11 O 2 C 


10 <— 


— > 1.0 0 I 2 C 



the overflow can be ignored. All computations are modulo 2 N . It follows that 
it is possible to have intermediate values that can not be correctly repre- 
sented, but if the final value is valid then the result is correct. For instance, 
if we add the 3-bit numbers 2 + 2 — 3, we would have an intermediate value of 
010 + 010 = IOO 2 C, be., — 4io, but the result 100 — Oil = 100 + 101 = OODc 
is correct. 

Two’s complement numbers can also be used to implement modulo 2 N 
arithmetic without any change in the arithmetic. This is what we will use in 
Chap. 5 to design CIC filters. 



One’s Complement (1C) 

An TV-bit one’s complement system (1C) can represent integers over the range 
[— (2 7V_1 + 1), 2 n ~ x — 1]. In a one’s complement code, positive and negative 
numbers have the same representation except for the sign bit. There is, in 
fact a redundant representation of zero (see Table 2.1). The representation 
of signed numbers in a 1C system is formally given by: 



EC'o 2 ^2” X>0 

-2 N ~ 1 + 1 + x n% n x < 0 . 



(2-4) 



For example, the three-bit 1C representation of the numbers —3 to 3 is shown 
in the third column of Table 2.1. 

From the following simple example 
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3io 


-+ 0 


1 lie 


— 2io <— 


-> 1 


0 lie 


lio 


-> 1. 0 


0 Oic 


Carry 


^ lie 


lio 


0 


0 lie 



we remember that in one’s complement a “carry wrap-around” addition is 
needed. A carry occurring at the MSB must be added to the LSB to get the 
correct final result. 

The system can, however, efficiently be used to implement modulo 2 N — 1 
arithmetic without correction. As a result, one’s complement has specialized 
value in implementing selected DSP algorithms (e.g., Mersenne transforms 
over the integer ring 2^ — 1; see Chap. 7). 



Diminished One System (Dl) 

A diminished one (Dl) system is a biased system. The positive numbers are, 
compared with the 2C, diminished by 1. The range for (7V-|-l)-bit Dl numbers 
is [— 2 Ar_1 , 2 Ar_1 ], excluding 0. The coding rule for a Dl system is defined as 
follows: 



£l"An2 n +l X > 0 

-2 jV - 1 + El7^2 n A'<0 

2 n X= 0. 



(2.5) 



From adding two Dl numbers 



3io < — 




0 


1 


Odi 


— 2io 


->• 


1 


1 


Odi 


lio < — 


1. 


0 


0 


Odi 


Carry 




x - 1 


I -» 


Odi 


lio 


-> 


0 


0 


Odi 



we see that, for Dl a complement and add of the inverted carry must be 
computed. 

Dl numbers can efficiently be used to implement modulo 2^ + 1 arithmetic 
without any change in the arithmetic. This fact will be used in Chap. 7 to 
implement Fermat NTT's in the ring 2^+1. 



Bias System 

The biased number system has a bias for all numbers. The bias value is 
usually in the middle of the binary range, i.e., bias = 2 N ~ 1 — 1. For a 3-bit 
system, for instance the bias would be 2 3-1 — 1 = 3. The range for 77-bit 
biased numbers is [— 2 N ~ l — 1,2^ -1 ]. Zero is coded as the bias. The coding 
rule for a biased system is defined as follows: 




2.2 Number Representation 



35 



Table 2.1. ( Conventional coding of signed binary numbers. 



Binary 


2C 


1C 


Dl 


SM 


Bias 


on 


3 


3 


4 


3 


0 


010 


2 


2 


3 


2 


-1 


001 


1 


1 


2 


1 


— 2 


000 


0 


0 


1 


0 


-3 


111 


-1 


-0 


-1 


-3 


4 


110 


_2 


-1 


-2 


—2 


3 


101 


-3 


-2 


-3 


-1 


2 


100 


-4 


-3 


-4 


-0 


1 



N - 1 

.Y = x n 2 n — bias. (2.6) 

n = 0 

From adding two biased numbers 



3io <— 


~ ^ 11 Obias 


+ (— 2io) <— 


-> 0 0 lbias 


4io c- 


-> 11 lbias 


— bias <— 


-> 0 1 lbias 


lio <— 


10 Obias 



we see that, for each addition the bias needs to be subtracted, while for every 
subtraction the bias needs to be added. 

Bias numbers can efficiently be used to simplify comparison of numbers. 
This fact will be used in Sec. 2.2.3 (p. 47) for coding the exponent of floating- 
point numbers. 

2.2.2 Unconventional Fixed-Point Numbers 

In the following we continue the review of number systems according to 
Fig. 2.1 (p. 31). The unconventional fixed-point number systems discussed in 
the following are not as often used, as for instance the 2C system, but can 
yield significant improvements for particular applications or problems. 



Signed Digit Numbers (SD) 

The signed digit (SD) system differs from the traditional binary systems 
presented in the previous section in the fact that it is ternary valued (i.e., 
digits have the value {0, 1, —1}, where —1 is sometimes denoted as 1). 

SD numbers have proven to be useful in carry-free adders or multipliers 
with less complexity, because the effort in multiplication can be typically 
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estimated through the number of nonzero elements, which can be reduced 
by using SD numbers. Statistically, half the digits in the two's complement 
coding of a number are zero. For an SD code, the density of zeros increases 
to two thirds as the following example shows: 

Example 2.1: SD coding 

Consider coding the decimal number 15 = IIII2 using a 5-bit binary and an 
SD code. Their representations are as follows: 

1) 15 10 = I610 - lio - IOOOIsd. _ 

2) 15io = I610 — 2io + 1 10 = IOOIIsd. 

3) 15io = I610 - 4i 0 + 3 k> = IOTIIsd. 

4) etc. 

ED 



The SD representation, unlike a 2C code, is nonunique. We call a canonic 
signed digit system, or CSD, the system with the minimum number of none- 
zero elements. The following algorithm can be used to produce a “classical” 
CSD code. 

Algorithm 2.2: Classical CSD Coding 

Starting with the LSB substitute all 1 sequences equal or larger two, with 

10... Ql. 

This CSD coding is the basis for the C utility program csd.exe on the CD- 
ROM. This classical CSD code is also unique and an additional property is 
that the resulting representation has at least one zero between two digits, 
which may have values 1, 1, or 0. 

Example 2.3: Classical CSD Code 

Consider again coding the decimal number 15 using a 5-bit, binary and a CSD 
code. Their representations are: IIII 2 = IOOOIcsd- We notice from a compar- 
ison with the SD coding from Example 2.1 that only the first representation 
is a CSD code. 

As another example consider the coding of 

27i 0 = IIOII 2 = lllOlsD - IOOIoTcsd. _ (2.7) 

We note that although the first substitution of Oil — * 101 does not reduce the 
complexity, this produces a length three strike, and the complexity reduces 
from 3 additions to two subtractions. | 2.3 | 



On the other hand, the classical CSD coding does not always produce the 
“optimal” CSD coding in terms of hardware complexity, because in Algorithm 
2.2 additions are also substituted by subtractions, when there should be no 
such substitution. For instance 01 12 is coded as IOIcsd, and if this coding is 
used to produce a constant multiplier the subtraction will need a full-adder 
instead of a half- adder for the LSB. The CSD coding given in the following 
will produce a CSD coding with the minimum number of nonzero terms, but 
also with the minimum number of subtractions. 
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Table 2.2. Adding carry-free binaries using SD representation. 



















XkVk 


00 


01 


01 


01 


01 


11 


11 


X k — 1 yk — 1 


- 


neither 


at least 


neither 


at least 


- 


- 






is 1 


one is 1 


is 1 


one is 1 






Ck 


0 


1 


0 


0 


I 


1 


T 


Uk 


0 


1 


1 


T 


1 


0 


0 



Algorithm 2.4: Optimal CSD Coding 

1) Starting with the LSB substitute all 1 sequences larger than two with 
10 . . .01. Also substitute 1011 with 1101. 

2) Starting with the MSB, substitute 101 with Oil. 



Carry-free Adder 

The SD number representation can be used to implement a carry-free adder. 
Tagaki et al. [26] introduced the scheme presented in Table 2.2. Here, Uk is 
the interim sum and c k is the carry of the k th bit (i.e., to be added to Uk+ 1 ). 

Example 2.5: Carry-free Addition 

The addition of 29 to —9 in the SD system is performed below. 

lOOlOUfe 

+ oTiTn yu 

0 0 0 1 1 1 c k 
1110 10 Uk 
110 10 0 s k 

F 7 ! 



However, due to the ternary logic burden, implementing Table 2.2 with 
FPGAs requires 4-input operands for the cj e and Uk - This translates into a 
2 8 x 4-bit LUT when implementing Table 2.2. 



Multiplier Adder Graph (MAG) 

We have seen that the cost of multiplication is a direct function of the number 
of nonzero elements ak in A. The CSD system minimizes this cost. The CSD is 
also the basis for the Booth multiplier [21] discussed in Exercise 2.2 (p. 104). 

It can, however, sometimes be more efficient first to factor the coefficient 
into several factors, and realize the individual factors in an optimal CSD sense 
[27, 28, 29, 30]. Figure 2.2 illustrates this option for the coefficient 93. The 
direct binary and CSD codes are given by 93io = IOIIIOI2 = IIOOIOIcsd? 
with the 2C requiring four adders, and the CSD requiring three adders. The 
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coefficient 93 can also be represented as 93 = 3 x 31, which require one adder 
for each factor (see Fig. 2.2). The complexity for the factor number is reduced 
to two. There are several ways to combine these different factors. The number 
of adders required is often referred to as the cost of the constant coefficient 
multiplier. Figure 2.3, suggested by Dempster et al. [29], shows all possible 
configurations for one to four adders. Using this graph, all coefficients with a 
cost ranging from one to four can be synthesized with ki E No, according to: 



Cost 1: 1) 

Cost 2: 1) 

2 ) 

Cost 3: 1) 



A = 2 ko (2 kl ± 2 k 2 ) 

A = 2 k °(2 kl ± 2 k2 ± 2 k 3 ) 

A = 2 k °(2 kl ± 2 k2 )(2 k3 ± 2 k *) 
A = 2 k °(2 kl ± 2 k2 ± 2 ks ± 2 k *) 



Using this technique, Table 2.3 shows the optimal coding for all eight-bit, 
integers having a cost between zero and three [5] . 



Logarithmic Number System (LNS) 

The logarithmic number system (LNS) [31, 32] is analogous to the floating- 
point system with a fixed mantissa and a fractional exponent. In the LNS, a 
number x is represented as: 

x = ±r ±e *, (2.8) 

where r is the system’s radix, and e x is the LNS exponent. The LNS format 
consists of a sign-bit, for the number and exponent, and an exponent assigned 
I integer bits and F fractional bits of precision. The format in graphical form 
is shown below: 



Sign 


Exponent 


Exponent, integer 


Exponent fractional 


5 1 , 


sign S e 


bits I 


bits F 




Fig. 2.2. Two realizations for the constant factor 93. 
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Fig. 2.3. Possible cost one to four graphs. Each node is either an adder or subtractor 
and each edge is associated with a power-of-two factor (©1995 IEEE [29]). 



The LNS, like floating-point, carries a nonuniform precision. Small values of 
x are highly resolved, while large values of x are more coarsely resolved as 
the following example shows. 

Example 2.6: LNS Coding 

Consider a radix-2 9-bit LNS word with two sign-bits, three bits for inte- 
ger precision and four-bit fractional precision. How can, for instance, the 
LNS coding 00 011.0010 be translated into the real number system? The 
two sign bits indicate that the whole number and the exponent are positive. 
The integer part is 3 and the fractional part 2 -3 = 1/8. The real num- 
ber representation is therefore 2 3+1 ^ 8 = 2 3 125 = 8.724. We find also that 
-2 3 125 = 10011.00 10 and 2 -3 125 = 01 100.1110. The largest number that 
can be represented with this 9-bit LNS format is 2 8-1 ^ 16 « 2 8 = 256 and 
the smallest is 2 -8 = 0.0039, as graphically interpreted in Fig. 2.4a. In con- 
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Table 2.3. Cost C (i.e., number of adders) for all eight-bit numbers using the 
multiplier adder graph (MAG) technique. 



c 


Coefficient 




















0 


4, 8, 


16, 32, 64, 128, 256 


















3, 5, 


6, 7, 9, 10, 12 


, 14, 15, 17, 18 


20 , 


24, 28, 30, 31, 


, 33, 


34, 36, 40 


48, 


1 


56, 60, 62, 63, 


65, 


66, 68, 72, 80, 


96, 


112, 120, 


124, 


126, 


127, 


129, 


130, 




132, 


136, 144, 


160, 


192, 224, 240, 


248 


, 252, 254 


, 255 












11, 13, 19, 21, 


22, 


23, 25, 26, 27, 


29, 


35, 37, 38 


:, 39, 


41, 


42, 44, 46 


, 47, 




49, 50, 52, 54, 


55, 


57, 58, 59, 61, 


67, 


69, 70, 71 


, 73, 


74, 


76, 78, 79 


, 81, 




82, 84, 88, 92, 


94, 95, 97, 98, 100, 


104, 


108, 110, 


111, 


113, 


114, 


116, 


118, 


2 


119, 


121, 122, 


123, 


125, 131, 133, 


134, 


135, 137, 


138, 


140, 


142, 


143, 


145, 




146, 


148, 152, 


156, 


158, 159, 161, 


162, 


164, 168, 


176, 


184, 


188, 


190, 


191, 




193, 


194, 196, 


200, 


208, 216, 220, 


999 , 


223, 225, 


226, 


228, 


232, 


236, 


238, 




239, 


241, 242, 


244, 


246, 247, 249, 


250 


, 251, 253 














43, 45, 51, 53, 


75, 


77, 83, 85, 86, 


87, 


89, 90, 91 


, 93, 


99, 


101, 


102, 


103, 




105, 


106, 107, 


109, 


115, 117, 139, 


141, 


147, 149, 


150, 


151, 


153, 


154, 


155, 


o 


157, 


163, 165, 


166, 


167, 169, 170, 


172, 


174, 175, 


177, 


178, 


180, 


182, 


183, 


o 


185, 


186, 187, 


189, 


195, 197, 198, 


199, 


201, 202, 


204, 


206, 


207, 


209, 


210, 




212, 


214, 215, 


217, 


218, 219, 221, 


227, 


229, 230, 


231, 


233, 


234, 


235, 


237, 




243, 


245 




















4 


171, 


173, 179, 


181, 


203, 205, 211, 


213 














Minimum costs through factorization 




45 = 


: 5 x 9, 51 


= 3 


x 17, 75 = 5 x 


15,; 


85 = 5 x : 


L7, 90 = 2 


: x 9 


x 5, 93 = 




3x31,99 = 3x33, 


102 = 2x3x 17, 105 = 7x15 


, 150 


= 2: 


x 5 x 


15, 153 = 


2 


9 x 


17, 155 = 


5 x 


31, 165 = 5 x 


33, ] 


70 = 2 x 


5 x 


17, 180 = 


= 4 x 


5 x 




9, 186 = 2x3 


x 31 


, 189 = 7 x 9, 


195 = 


= 3 x 65, 198 = 


= 2 x 


3 x 


33, 204 = 




4 x 3 x 17,210 = 2 


x 7 x 15,217 


= 7 


x 31,231 = 


= 7 x 


: 33 








Q 


171 


= 3 x 57 : 


, 173 


= 8 + 165, 179 = 


51 + 128 


, 181 


= 1 


+ 180, 211 = 


O 


1 + - 


>10, 213 = 


3 x 


71,205 = 5 x 41,203 = 7 x 29 











trast, an 8-bit plus sign fixed-point number has a maximal positive value of 
2 8 — 1 = 255, and the smallest nonzero positive value is one. A comparison 
of the two 9-bit systems is shown in Fig. 2.4b. | 2.6 | 



The historical attraction of the LNS lies in its ability to efficiently imple- 
ment multiplication, division, square-rooting, or squaring. For example, the 
product C — A x B ) where A, B, and C are LNS words, is given by: 

C — r 6a x r eb = r Ga+eb = r 6c . (2.9) 

That is, the exponent of the LNS product is simply the sum of the two expo- 
nents. Division and high-order operations immediately follow. Unfortunately, 
addition or subtraction are by comparison far more complex. Addition and 
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(a) (b) 





Fig. 2.4. LNS processing, (a) Values, (b) Resolution. 



subtraction operations are based on the following procedure, where it is as- 
sumed that A > B. 



C = A + B = 2 e ° + 2 e » = 2 e “ (1 + 2 e6 - e °) = 2 e °. (2.10) 

S v ' 

$+{A) 

Solving for the exponent e c , one obtains e c = e a + </> + (Z\) where A = 
et — e a and </> + (i/) = log 2 (0 + (Z\)). For subtraction a similar table, <f>~ (u) = 
log 2 (0“(/A)), <P~ (A) = (1 — 2 eb ~ Ga ), can be used. Such tables have been 
historically used for rational numbers as described in “Logarithmorm Com- 
pletus,” Jurij Vega (1754-1802), containing tables computed by Zech. As a 
result, the term log 2 (l — 2 U ) is usually referred to as a Zech logarithm. 

LNS arithmetic is performed in the following manner [31]. Let A = 
2 e «, B = 2 eb , C — r 6c , with Sa, Sb, Sc denoting the sign-bit for each word: 



Operation 




Action 


Multiply 


C = AB 


e c — e a T £b\ S c — S a XOR Sb 


Divide 


0Q 

II 

O 


e c — Ca Sc — Sa XOR Sb 


Add 


C- A + B 


_ { e a + 0 + (e 6 - e a ) A > B 
1 e 6 H - — tb ) B > A 


Subtract 


C = A — B 


_ J e a + 0 { e b — e a ) A > B 
C \ £b T 0 { e a ~ &b) B > A 


Square root 


c = Va 


e c — e a/2 


Square 


C = A 2 


e c = 2e a 



Methods have been developed to reduce the necessary table size for the 
Zech logarithm by using partial tables [31] or using linear interpolation tech- 
niques [33]. These techniques are beyond the scope of the discussion presented 
here. 
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Residue Number System (RNS) 

The RNS is actually an ancient algebraic system whose history can be traced 
back 2000 years. The RNS is an integer arithmetic system in which the prim- 
itive operations of addition, subtraction, and multiplication are defined. The 
primitive operations are performed concurrently within noncommunicating 
small- wordlength channels [34, 35]. An RNS system is defined with respect 
to a positive integer basis set {mi, m 2 , . . .,mi}, where the m [ s are all rel- 
atively (pairwise) prime. The dynamic range of the resulting system is M 
where M = Tii=i m i • For signed-number applications, the integer value of 
X is assumed to be constrained to X £ [— M/2, M/2). RNS arithmetic is 
defined within a ring isomorphism: 

7L m = Z mi x Z m2 x • • • x Z mL , (2.11) 

where 7Lm — 7L/{M) corresponds to the ring of integers modulo M, called 
the residue class modM. The mapping of an integer X into a RNS L-tuple 
X (a?i, xo, . . . , xl) is defined by xi = X mod mi , for / = 1, 2, ... L. Defining 
□ to be the algebraic operations +, — or *, it follows that if Z,X,Y £ Zm, 
then: 



Z = XDY mod M 

is isomorphic to Z (z \ , Z 2 , . . . , Al),- Specifically: 



( 2 . 12 ) 



X 


(mi,m 2r 


, m L ) 


Y 


(mi , m, 2 , • 

i 


m L ) 




— r 


cxj 

II 

□ 


(mi , 1712 ,. 


, m l ) 



«A'> mi 5 (A)m 2 (X) m L ) 

((*%, , (A) m. 2 > • • • i (Y) m L ) 

({xnY) mi ,(xn Y) ma ,(xn y)„ 1l ). 



As a result, RNS arithmetic is a pairwise” defined. The L elements of Z = 
(Xny) mod M are computed concurrently within L small wordlength mod 
(mi) channels whose width is bounded by wi = [log 2 (m/)] bits (typical 4- to 
8-bits). In practice, most RNS arithmetic systems use small RAM or ROM 
tables to implement the modular mappings z/ = xi\3yi mod m/. 

Example 2.7: RNS Arithmetic 

Consider an RNS system based on the relatively prime moduli set { 2 , 3, 5} 
having a dynamic range of M = 2 x 3 x 5 = 30. Two integers in Z 30 , say 7io 
and 4io, have RNS representations 7 = (1,1,2)rns and 4 = (0, 1,4)rns re- 
spectively. Their sum, difference, and products are 11,3, and 28, respectively, 
which are all within Z 30 . Their computation is shown below. 



( 1 , 1 , 2 ) 

+4 m +(0,1,4) 

nm (1,2,1) 



7 < 

-4 < 
3 < 



( 2 . 3 . 5 ) 

(2.3.5) 
(2,3,5) 



( 1 , 1 , 2 ) 

-(0,1,4) 

(1,0,3) 



x4 ^x(0,1,4) 

28 < (2,3 ’ 5) > (0, 1,3). 

r^i 
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RNS systems have been built as custom VLSI devices [36], GaAs, and LSI 
[35]. It has been shown that for small wordlengths, the RNS can provide a 
significant speed-up using the 2 4 x 2-bit tables found in Xilinx XC4000 FPGAs 
[37]. For larger moduli, the 2 8 x 8-bit tables belonging to the Altera FLEX are 
beneficial in designing RNS arithmetic and RNS-to-integer converters. With 
the ability to support larger moduli, the design of high-precision high-speed 
FPGA systems becomes a practical reality. 

An historical barrier to implementing practical RNS systems, until re- 
cently, has been decoding [38]. Implementing RNS-to-integer decoder, divi- 
sion, or magnitude scaling, requires that data first be converted from an RNS 
format to an integer. The commonly referenced RNS-to-integer conversion 
methods are called the Chinese remainder theorem (CRT) and mixed-radix- 
conversion (MRC) algorithm [34]. The MRC actually produced the digits of 
a weighted number system representation of an integer while the CRT maps 
an RNS L-tuple directly to an integer. The CRT is defined below. 

L - 1 

X mod M = xi) mi mod M, (2.13) 

1=0 

where m/ = Mjm\ is an integer, and rri[ l is the multiplicative inverse of 
mi mod m;, i.e., m/m^ 1 = 1 mod Typically, the desired output of an 
RNS computation is much less than the maximum dynamic range M. In 
such cases, a highly efficient algorithm, called the £— CRT [39], can be used 
to implement a time- and area-efficient RNS to (scaled) integer conversion. 

Index Multiplier 

There are, in fact, several variations of the RNS. One in common use is 
based on the use of “index” arithmetic [34]. It is similar in some respects 
to logarithmic arithmetic. Computation in the index domain is based on the 
fact that if all the moduli are primes, it is known from number theory that 
there exists a primitive element, a generator g, such that: 



a — g a mod p (2-14) 

that generates all elements in the field excluding zero (denoted 7L V /{0». 
There is, in fact, a one-to-one correspondence between the integers a in 
Z p /{0} and the exponents a in Z p _i. As a point of terminology, the index 
a, with respect to the generator g and integer a, is denoted a = ind^(a). 

Example 2.8: Index Coding 

Consider a prime moduli p = 17, a generator g = 3 will generate the elements 
of Z p /{0}. The encoding table is shown below. For notational purposes, the 
case a = 0 is denoted by g~°° = 0. 
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a 


0 12 3 


4 5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


ind 3 (a) 


-oo 0 14 1 


12 5 


15 


11 


10 


2 


3 


7 


13 


4 


9 


6 


8 



| 2.8 | 



Multiplication of RNS numbers can be performed as follows: 

1) Map a and b into the index domain, i.e., a — g a and b — 

2 ) Add the index values modulo p — 1, i.e., v — (a + (3) mod ( p — 1) 

3) Map the sum back to the original domain, i.e., n = g u 

If the data being processed is in index form, then only exponent addition 
mod(p — 1) is required. This can be illustrated by the following example. 

Example 2.9: Index Multiplication 

Consider the prime moduli p = 17, generator g = 3, and the results shown 
in Example 2.8. The multiplication of a = 2 and 6 = 4 proceeds as follows: 

(ind g (2) + ind g (4)) mod 16 = (14 + 12) mod 16 = 10. 

From the table in Example 2.8 it is seen that inds(8) = 10, which corresponds 
to the integer 8, which is the expected result. E3 



Addition in the Index Domain 

Most often, DSP algorithms require both multiplication and addition. Index 
arithmetic is well suited to multiplication, but addition is no longer trivial. 
Technically, addition can be performed by converting index RNS data back 
into the RNS where addition is simple to implement. Once the sum is com- 
puted the result is mapped back into the index domain. Another approach 
is based on a Zech logarithm. The sum of index-coded numbers a and b is 
expressed as: 

d = a + b = g s =g a +gl } =g a (l+g> } - a ) = / (l + • (2.15) 



If we now define the Zech logarithm as 



Definition 2.10: 


Zech Logarithm 




Z(n) = ind s (l + g n ) 


<-» g z{n) = l + g n 


(2.16) 



then we can rewrite (2.15) in the following way: 



g* = / X g z ( a -P) ► 6 = 0 + Z(a-/3). (2.17) 

Adding numbers in the index domain, therefore, requires one addition, one 
subtraction, and a Zech LUT. The following small example illustrates the 
principle of adding 2 + 5 in the index domain. 
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Example 2.11: Zech Logarithms 

A table of Zech logarithms, for a prime moduli 17 and g = 3, is shown below. 



n 


—00 0 


1 


2 


3 


4 


5 


6 


7 


8 9 


10 


11 


12 


13 


14 


15 


Z(n) 


0 14 


12 


3 


7 


9 


15 


8 


13 


—00 6 


2 


10 


5 


4 


1 


11 



The index values for 2 and 5 are defined in the tables found in Example 2.8 
(p. 43). It therefore follows that: 

2 + 5 = 3 14 + 3 5 = 3 5 (1 + 3 9 ) = 3 5+Z(9) = 3 11 = 7 mod 17. 

I 2.H I 



The case where a -f b — 0 needs special attention, corresponding to the 
case where [40]: 



—A" := Y mod p < — >■ g a +(p !)/2 = g@ mod p. 

That is, the sum is zero if, in the index domain, (3 — a + (p— 1)/2 mod (p— 1). 
An example follows. 

Example 2.12: The addition of 5 and 12 in the original domain is given by 

5 + 12 = 3 5 + 3 13 = 3 5 (1 + 3 8 ) = 3 5+z(8) = 3“°° = 0 mod 17. |T77] 



Complex Multiplication using QRNS 

Another interesting property of the RNS arises if we process complex data. 
This special representation called QRNS allows very efficient multiplication, 
which we wish to discuss next. 

When the real and imaginary components are coded as RNS digits, the 
resulting system is called the complex RNS or CRNS. Complex addition in 
the CRNS requires that two real adds be performed. Complex RNS (CRNS) 
multiplication is defined in terms of four real products, an addition, and a 
subtraction. This condition is radically changed when using a variant of the 
RNS, called the quadratic RNS, or QRNS. The QRNS is based on known 
properties of Gaussian primes of the form p = 4k + 1 , where k is a positive 
integer. The importance of this choice of moduli is found in the factorization 
of the polynomial x 2 -\- 1 in 7L V . The polynomial has two roots, j and —j, where 
j and — j are real integers belonging to the residue class 7L V . This is in sharp 
contrast with the factoring of x 2 + 1 over the complex field. Here, the roots 
are complex and have the form x\ 2 = <a±j/? where j = \f—\ is the imaginary 
operator. Converting a CRNS number into the QRNS is accomplished by the 
transform / : 7L 2 ^ defined as follows: 

f{a + j6) = ((a + jb) mod p, (a - jb) mod p) - ( A , B). 



(2.18) 
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Fig. 2.5. CRNS f> QRNS conversion. 



In the QRNS, addition and multiplication is realized componentwise, and is 
defined as 

(a+ja) + (c + jd) <+(A + C,B + D) mod p (2.19) 

(a+jfc)(c + jd) +> (AC,BD) mod p (2.20) 

and the square of the absolute value can be computed with 

\ a + jb\ 2 (A x B) mod p. (2.21) 

The inverse mapping from QRNS digits back to the CRNS is defined by: 

f- 1 (A,B) = 2- 1 (A + B)+j{2j)- 1 (A-B) mod p. (2.22) 



Consider the Gaussian prime p — 13 and the complex product of (a T j&) = 
(2 + jl), (c + id) = (3 + j2), is (2 + jl) x (3 + j2) = (4 + j 7) mod 13. In 
this case four real multiplies, a real add, and real subtraction are required to 
complete the product. 

Example 2.13: QRNS Multiplication 

The quadratic equation x 2 = ( — 1) mod 13 has two roots: j = 5 and — j = 
— 5 = 8 mod 13. The QRNS coded data become: 

(a+j&)= 2+j <->(2 + 5 x 1,2 + 8 x 1) = (A, R) = (7, 10) mod 13 
(c + jd)=3 + j2^(3 + 5 x 2,3 + 8 x 2)=(C, D)= (0, 6) mod 13. 
Componentwise multiplication yields (A,R)(C, D) = (7, 10)(0,6) = (0,8) 
mod 13 requiring only two real multiplies. The inverse mapping to the CRNS 
is defined in terms of (2.22), where 2 -1 = 7 and (2j) -1 = 10 -1 = 4. Solving 
the equations for 2x = 1 mod 13 and lOx = 1 mod 13, produces 7 and 4, 
respectively. It then follows that 

/ -1 (0,8) = 7(0 + 8) + j 4(0-8) mod 13 = 4+j7mod 13. / 



2.13 
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Figure 2.5 shows a graphical interpretation of the mapping between CRNS 
and QRNS. 



2.2.3 Floating-Point Numbers 



Floating-point systems were developed to provide high resolution over a large 
dynamic range. Floating-point systems often can provide a solution when 
fixed-point systems, with their limited dynamic range, fail. Floating-point 
systems, however, bring a speed and complexity penalty. Most micropro- 
cessor floating-point systems comply with the published single- or double- 
precision IEEE floating-point standard [41, 42], while FPGA-based systems 
often employ custom formats. We will therefore discuss in the following stan- 
dard and custom floating-point formats, and in Sec. 2.6 (p. 76) the design 
of basic building blocks. Such arithmetic blocks are available from several 
“intellectual property” providers, or through special request via e-mail to 
U we .Meyer-Baese@ieee.org . 

A standard floating-point word consists of a sign-bit s, exponent e, and 
an unsigned (fractional) normalized mantissa m, arranged as follows: 



s 



Exponent e 



Unsigned mantissa m 



Algebraically, a floating-point word is represented by: 



X = (-1) 5 x l.m x 2 e-bias . (2.23) 

Note that this is a signed magnitude format (see p. 35). The “hidden” one in 
the mantissa is not present in the binary coding of the floating-point number. 
If the exponent is represented with E bits then the bias is selected to be 

bias = 2 E_1 - 1. (2.24) 

To illustrate, let us determine the decimal value 9.25 in a 12-bit custom 
floating-point format. 

Example 2.14: A (1,6,5) Floating-point Format 

Consider a floating-point representation with a sign bit, E — 6-bit exponent 
width, and M = 5-bit for the mantissa (not counting the hidden one). Let 
us now determine the representation of 9.25io in this (1,6,5) floating-point 
format. Using (2.24) the bias is 
bias = 2^~ 1 — 1 = 31, 

and the mantissa need to be normalized according the l.m format, i.e., 
9.25io = IOOI.OI 2 = 1.00101 x2 3 . 

m 

The biased exponent is therefore represented with 
e = 3 + bias = 34io = IOOOIO 2 . 

Finally, we can represent 9. 25 10 in the (1,6,5) floating-point format with 
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s 


Exponent e 


Unsigned mantissa m 


0 


100010 


00101 



Besides this fixed-point to floating-point conversion we also need the back 
conversion from floating-point to integer. So, let us assume the following 
floating-point number 



s 


Exponent e 


Unsigned mantissa m 


1 


011111 


00000 



is given and we wish to find the fixed-point representation of this number. 
We first notice that the sign bit is one, i.e., it is a negative number. Adding 
the hidden one to the mantissa and subtracting the bias from the exponent, 
yields 

-1.000002 x 2 31-blas = -1.0 2 '2° = -I.O 10 . 

We note that in the floating-point to fixed-point conversion the bias is sub- 
tracted from the exponent, while in the fixed-point, to floating-point conver- 
sion the bias is added to the exponent. I 2.14 1 



The IEEE standard 754-1985 for binary floating-point arithmetic [41] also 
defines some additional useful special numbers to handle, for instance, over- 
flow and underflow. The exponent e = U max = 1 . . . 1 2 in combination with 
zero mantissa m = 0 is reserved for 00 . Zeros are coded with zero exponent 
e = E m i n = 0 and zero mantissa m — 0. Note, that due to the signed mag- 
nitude representation, plus and minus zero are coded differently. There are 
two more special numbers defined in the 754 standard, but these additional 
representations are most often not supported in FPGA floating-point arith- 
metic. These additional number are denormals and NaN's (not a number). 
With denormalized numbers we can represent numbers smaller than 2 Em[n , 
by allowing the mantissa to represent numbers without the hidden one, i.e., 
the mantissa can represents numbers smaller than 1.0. The exponent in de- 
normals is code with e = E mm = 0, but the mantissa is allowed to be different 
from zero. NaNs have proven useful in software systems to reduce the num- 
ber of "‘exceptions” that are called when an invalid operation is performed. 
Examples that produce such “quiet” NaNs include: 

• Addition or subtraction of two infinities, such as 00 — 00 

• Multiplication of zero and infinite, e.g., 0 x 00 

• Division of zeros or infinities, e.g., 0/0 or 00/00 

• Square root of negative operand 

In the IEEE standard 754-1985 for binary floating-point arithmetic NaNs 
are coded with exponent e = U max = 1 . . . I 2 in combination with a nonzero 
mantissa m/0. 

We wish now to compare the fixed-point, and floating-point representation 
in terms of precision and dynamic range in the following example. 

Example 2.15: 12-Bit Floating- and Fixed-point Representations 

Suppose we use again a (1,6,5) floating-point format as in the previous ex- 
ample. The (absolute) largest number we can represent is: 

±1.11111 2 x 2 31 « ±4.23io x 10 9 . 
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Table 2.4. Example values in (1,6,5) floating-point format. 





(1,6,5) format 


Decimal 


Coding 


0 


000000 


00000 


+0 


2 £ mi , 


1 


000000 


00000 


-0 


2 


0 


011111 


00000 


+ 1.0 


2bias 


1 


011111 


00000 


-1.0 


^bias 


0 


111111 


00000 


+00 


2 -^max 


1 


111111 


00000 


— 00 


-2 b ““ 



The (absolutely measured) smallest number (not including denormals) that 
can be represented is 

±1.0 2 x 2 1_bias = ±1.0 2 x 2 -30 « ±9.31io x 10~ 10 . 

Note, that T max = 1 . . . I 2 and i n = 0 are reserved for zero and infinity in 
the floating-point format, and must not be used for general number represen- 
tations. Table 2.4 shows some example coding for the (1,6,5) floating-point 
format including the special numbers. 

For the 12-bit fixed-point format we use one sign bit, 5 integer bits, and 6 
fractional bits. The maximum (absolute) values we can represent with this 
12-bit fixed-point format are therefore: 

= ±(16±8± ^ + ^)io 

= ±(32 — — )io « ±32.0io. 

64 

The (absolutely measured) smallest number that this 12-bit fixed-point for- 
mat represents is 

±00000.000001 2 = ±— = ±0.015625io. 

64 10 

I 2.15 I 



From this example we notice the larger dynamic range of the floating-point 
representation (4 x 10 9 compared with 32) but also a higher precision of the 
fixed-point representation. For instance, 1.0 and 1 + 1/64 = 1.015625 are code 
the same in (1,6,5) floating-point format, but can be distinguished in 12-bit 
fixed-point representation. 

Although the IEEE standard 754-1985 for binary floating-point arith- 
metic [41] is not easy to implement with all its details such as four different 
rounding modes, denormals, or NaNs, the early introduction in 1985 of the 
standard helped as it has become the most adopted implementation for mi- 
croprocessors. The parameters of this IEEE single and double format can 
be seen from Table 2.5. Due to the fact that already single-precision 754 
standard arithmetic designs will require 

• a 24 x 24 bit multiplier, and 

• FPGAs allow a more specific dynamic range design (i.e., exponent bit 
width) and precision (mantissa bit width) design 
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we find that FPGAs design usually do not adopt the 754 standard and define 
a special format. Shirazi et al. [43], for instance, have developed a modified 
format to implement various algorithms on their custom computing machine 
called SPLASH-2, a multiple-FPGA board based on Xilinx XC4010 devices. 
They used an 18-bit format so that they can transport two operands over the 
36-bit wide system bus of the multiple-FPGA board. The 18-bit format has 
a 10-bit mantissa, 7-bit exponent and a sign bit, and can represent a range 
of 3.7 x 10 19 . 



Table 2.5. IEEE floating-point standard. 





Single 


Double 


Word length 


32 


64 


Mantissa 


23 


52 


Exponent 


8 


11 


Bias 


127 


1023 


Range 


2 128 « 3.8 x 10 38 


2 1024 « 1.8 x 10 308 



2.3 Binary Adders 

A basic binary X-bit adder/subtractor consists of N full-adders (FA). A 



full-adder implements the following Boolean equations 

s k = x k XOR y k XOR c k (2.25) 

= x k 0 y k 0 c k (2.26) 

that define the sum-bit. The carry (out) bit is computed with: 

c k + 1 = (x k AND y k ) OR (x k AND c k ) OR (y k AND c k ) (2.27) 

= {xk x yk) + {xk x c k ) + {yk x c k ) (2.28) 



In the case of a 2C adder, the LSB can be reduced to a half-adder because 
the carry input is zero. 

The simplest adder structure is called the “ripple carry adder” as shown 
in Fig. 2.6a in a bit-serial form. If larger tables are available in the FPGA, 
several bits can be grouped together into one LUT, as shown in Fig. 2.6b. 
For this “two bit at a time” adder the longest delay comes from the ripple of 
the carry through all stages. Attempts have been made to reduce the carry 
delays using techniques such as the carry-skip, carry lookahead, conditional 
sum, or carry-select adders. These techniques can speed up addition and can 
be used with older-generation FPGA families (e.g., XC 3000 from Xilinx) 
since these devices do not provide internal fast carry logic. Modern families, 
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a[3]b[3] 



a[2] b[2] 



a[l]b[l] 



a[0] b[0] 




c[0] 



s[3] s[2] 

a[3]b[3] a[2]b[2] 



s[l] s[0] 

a[l]b[l] a[0] b[0] 



(b) 


M \l \l \l 




JLJLJLJf 


^[4] 


2-bit adder 
LUT 2 5 x3 


c[2] 


2-bit adder 
LUT 2 5 x3 



s[3] s[2] 



s[l] s[0] 



Fig. 2.6. Two’s complement adders. 




c[0] 




Fig. 2.7. XC4000 fast-carry logic (©1993 Xilinx). 



such as the Xilinx XC4K or Altera FLEX, possess very fast “ripple carry 
logic” that is about a magnitude faster than the delay through a regular 
logic LUT [1]. Altera uses fast tables (see Fig. 1.12, p. 18), while the Xilinx 
XC4K uses hardwired decoders for implementing carry logic based on the 
multiplexer structure shown in Fig. 2.7. The presence of the fast-carry logic 
in modern FPGA families removes the need to develop hardware intensive 
carry look-ahead schemes. 
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Figure 2.8 summarizes the size and Registered Performance of TV-bit 
binary adders, if implemented with the lpm_add_sub megafunction compo- 
nent. If the operands are applied through I/O cells, the delays through the 
busses of a FLEX device are dominant and performance decreases if the reg- 
isters of the I/O cells are used (Option: Assign-* Global Project Logic 
Synthesis— > Automatic Fast I/O). If the data are routed from local regis- 
ters, performance improves. This type of additional LC register allocation 
will appear (in the project report file) as increased LC use by a factor of 
three. A synchronous registered design would not consume any additional 
resources. A typical design will achieve a speed between these two cases. 



2.3.1 Pipelined Adders 

Pipelining is extensively used in DSP solutions due to the intrinsic dataflow 
regularity of DSP algorithms. Programmable digital signal processor MACs 
[6, 14, 15] typically carry at least four pipelined stages. The processor: 

1) Decodes the command 

2 ) Loads the operands in registers 
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Table 2.6. Performance of a 5-bit pipelined adder using synthesis of predefined 
LPM modules with pipeline option. 



Pipeline Use I/O No use of 

stages register I/O register 

MHz LCs MHz LCs 



0 


67.11 


6 


108.69 


15 


1 


65.78 


10 


101.01 


21 


2 


65.78 


16 


90.90 


28 



3 ) Performs multiplication and stores the product, and 

4 ) Accumulates the products, all concurrently. 

The pipelining principle can be applied to FPGA designs as well, at little 
or no additional cost since each logic element contains a flip-flop, which is 
otherwise unused, to save routing resources. With pipelining it is possible 
to break an arithmetic operation into small primitive operations, save the 
carry and the intermediate values in registers, and continue the calculation 
in the next clock cycle. Such adders are sometimes called carry save adders 1 
(CSAs) in the literature. Then the question arises: In how many pieces should 
we divide the adder? Should we use bit level? For Altera’s Flex 10K devices 
a reasonable choice will be always using an LAB with 8 LCs and 8FFs for 
one pipeline element. In fact, it can be shown that if we try to pipeline (for 
instance) a 5-bit adder, the performance drops, as reported in Table 2.6, 
because the pipelined 5-bit adder does not fit in one LAB. 

Because the number of flip-flops in one LAB is 8 and we need an extra 
flip-flop for the carry-out, we should use a maximum block size of 7 bits for 
maximum Registered Performance. Only the blocks with the MSBs can 
be 8 bits wide, because we do not need the extra flip-flop for the carry. This 
observation leads to the following conclusions: 

1) With one additional pipeline stage we can build adders up to a length 
7 + 8 = 15. 

2 ) With two pipeline stages we can build adders with up to 7 + 7 + 8 = 22-bit 
length. 

3 ) With three pipeline stages we can build adders with up to 7 + 73-7 + 8 = 
29-bit length. 

Table 2.7 shows the Registered Performance and LC utilization of this 
kind of pipelined adder. From Table 2.7 it can be concluded that although 
the bit width increases the Registered Performance remains almost the 
same if we add the appropriate number of pipeline stages. It can also be 
seen from Table 2.7 that the new Quartus fitter does not always improve the 

1 The name carry save adder is also used in the context of a Wallace multiplier, 
see Exercise 2.1, p. 103. 
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Table 2.7. Performance of pipelined adders. Size and speed are for the maximum 
bit width, for 15-, 22-, and 29-bit adders. Two Registered Performance are shown: 
with/ without Quartus fitter. 



Bit 

width 


Use I/O 
register 
MHz 


LCs 


No use of 
I/O register 
MHz LCs 


Pipeline 

stages 


Design 
file name 


9-15 


63.29/no fit 


26 


78.74/87.71 61 


1 


add_lp . vhd 


16 - 22 


63.29/62.50 


58 


71.94/84.03 113 


2 


add_2p . vhd 


23 - 29 


60.97/63.69 


105 


74.07/81.30 180 


3 


add_3p . vhd 




Fig. 2.9. Pipelined adder. 



Registered Performance of the design, especially if the I/O register are not 
used. It is also interesting to note that the normal fitter can not place the 26 
LC design add_lp.vhd, while the larger design can be fitted in the device. 

The following example shows the code of a 15-bit pipelined adder that is 
graphically interpreted by Fig. 2.9. 
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Example 2.16: VHDL Design of 15-bit Pipelined Adder 

Consider the VHDL code 2 of a 15-bit pipelined adder that is graphically 
interpreted in Fig. 2.9. Depending on the synthesis style, the design runs at 
63 to 87 MHz. 

LIBRARY 1pm; 

USE 1pm. lpm_ components. ALL; 

LIBRARY ieee ; 

USE ieee . std_logic_l 164 . ALL; 

USE ieee . std_logic_arith. ALL; 



ENTITY add_lp IS 



GENERIC (WIDTH 


INTEGER 


LO 

ii 


— Total bit 


width 


WIDTH1 


INTEGER 


:= 7; 


— Bit width 


of LSBs 


WIDTH2 


INTEGER 


:= 8; 


— Bit width 


of MSBs 


ONE 


INTEGER 


:= 1); 


— 1 bit for 


carry reg 



PORT (x,y : IN STD_L0GIC_VECT0R(WIDTH-1 DOWNTO 0) ; 



— Inputs 

sum : OUT STD_L0GIC_VECT0R (WIDTH- 1 DOWNTO 0); 

— Result 

elk : IN STD.LOGIC) ; 

END add.lp; 



ARCHITECTURE flex OF 
SIGNAL 11, 12, rl, 

SIGNAL 13, 14, r2 , 

SIGNAL s 

SIGNAL crl , cql 



add_lp IS 

ql — LSBs of inputs 

: STD_L0GIC_VECT0R(WIDTH1-1 DOWNTO 0) ; 
q2, u2, h2 — MSBs of inputs 

: STD_L0GIC_VECT0R(WIDTH2-1 DOWNTO 0) ; 

: STD_L0GIC_VECT0R (WIDTH- 1 DOWNTO 0); 

— Output register 
: STD_L0GIC_VECT0R(0NE-1 DOWNTO 0); 

— LSBs carry signal 



BEGIN 

PROCESS — Split in MSBs and LSBs and store in registers 
BEGIN 

WAIT UNTIL elk = ’ 1 ’ ; 

— Split LSBs from input x,y 
FOR k IN WIDTH1-1 DOWNTO 0 LOOP 
11 (k) <= x (k) ; 

12 (k) <= y (k) ; 

END LOOP; 

— Split MSBs from input x,y 
FOR k IN WIDTH2-1 DOWNTO 0 LOOP 
13 (k) <= x (k+WIDTHl) ; 

14 (k) <= y (k+WIDTHl) ; 

END LOOP; 

END PROCESS; 

First stage of the adder 



2 The equivalent Verilog code add_lp.v for this example can be found in Ap- 
pendix A on page 437. 
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add_l: lpm_add_sub — Add LSBs of x and y 

GENERIC MAP ( LPM.WIDTH => WIDTH1, 

LPM.REPRESENTATION => "UNSIGNED", 
LPM.DIRECTION => "ADD") 

PORT MAP ( dataa => 11, datab => 12, 

result => rl, cout => crl(O)); 
reg_l: lpm_ff — Save LSBs of x+y and carry 

GENERIC MAP ( LPM.WIDTH => WIDTH1 ) 

PORT MAP ( data => rl, q => ql, clock => elk ); 
reg_2: lpm_ff 

GENERIC MAP ( LPM.WIDTH => ONE ) 

PORT MAP ( data => crl, q => cql , clock => elk ); 

add_2: lpm_add_sub — Add MSBs of x and y 

GENERIC MAP ( LPM.WIDTH => WIDTH2 , 

LPM.REPRESENTATION => "UNSIGNED", 
LPM.DIRECTION => "ADD") 

PORT MAP (dataa => 13, datab => 14, result => r2) ; 
reg_3: lpm.ff — Save MSBs of x+y 

GENERIC MAP ( LPM.WIDTH => WIDTH2 ) 

PORT MAP ( data => r2, q => q2 , clock => elk ) ; 

Second stage of the adder 

— One operand is zero 
h2 <= (OTHERS => ’O’) ; 



— Add result from MSBs (x+y) and carry from LSBs 
add. 3 : lpm. add. sub 

GENERIC MAP ( LPM.WIDTH => WIDTH2 , 

LPM.REPRESENTATION => "UNSIGNED", 
LPM.DIRECTION => "ADD") 

PORT MAP ( cin => cql(O), dataa => q2 , 
datab => h2 , result => u2 ); 

PROCESS — Build a single registered output 

BEGIN — word of WIDTH=WIDTH1+WIDTH2 

WAIT UNTIL elk = ’1’ ; 

FOR k IN WIDTH1-1 DOWNTO 0 LOOP 
s (k) <= ql (k) ; 

END LOOP; 

FOR k IN WIDTH2-1 DOWNTO 0 LOOP 
s (k+WIDTHl) <= u2(k) ; 

END LOOP; 

END PROCESS; 

sum <= s ; — Connect s to output pins 

END flex; 

The simulated performance of the 15-bit pipelined adder is shown in Fig. 2.10. 
Note that the addition of 140 and 130 produces a carry from the lower 7-bit 
adder, but there is no carry for 120 + 5 = 125 < 127. | 2.16 | 
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2.3.2 Modulo Adders 

Modulo adders are the most important building blocks in RNS-DSP designs. 
They are used for both additions and, via index arithmetic, for multiplica- 
tions. We wish to describe some design options for FPGAs in the following 
discussion. 

A wide variety of modular addition designs exists [44]. Using LEs only, the 
design of Fig. 2.11a is viable for FPGAs. The Altera FLEX devices contain 
a small number of 2-kbit ROMs or RAMs (EABs) that can be configured 
as 2 8 x 8, 2 9 x 4, 2 10 x 2 or 2 11 x 1 tables and can be used for modulo m/ 
correction. The next table shows size and Registered Performance 6 , 7, 
and 8-bit modulo adder [45] . 





Pipeline 




Bits 






stages 


6 


7 


8 


MPX 


0 


41.3 MSPS 


46.5 MSPS 


33.7 MSPS 


27 LE 


31 LE 


35 LE 


MPX 


9 


76.3 MSPS 


62.5 MSPS 


60.9 MSPS 


z 


16 LE 


18 LE 


20 LE 


MPX 


9 


151.5 MSPS 


138.9 MSPS 


123.5 MSPS 


O 


27 LE 


31 LE 


35 LE 






86.2 MSPS 


86.2 MSPS 


86.2 MSPS 


ROM 


3 


7 LE 


8 LE 


9 LE 






1 EAB 


1 EAB 


2 EAB 



Although the ROM shown in Fig 2.11 provides high speed, the ROM 
itself produces a four-cycle pipeline delay and the number of ROMs is limited. 
ROMs, however, are mandatory for the scaling schemes discussed before. The 
multiplexed-adder (MPX- Add) has a comparatively reduced speed even if a 
carry chain is added to each column. The pipelined version usually needs the 
same number of LEs as the unpipelined version but runs about twice as fast. 
Maximum throughput occurs when the adders are implemented in two blocks 
within 6-bit pipelined channels. 
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x y 




(a) (b) 



Fig. 2.11. Modular additions, (a) MPX-Add and MPX-Add-Pipe. (b) ROM-Pipe. 



2.4 Binary Multipliers 

The product of two iV-bit binary numbers, say .Y and A = o a k^ k \ is 
given by the “pencil and paper” method as: 

N — 1 

P = AxX=J2 a*2*.Y. (2.29) 

k = 0 

It can be seen that the input X is successively shifted by k positions and 
whenever / 0, then X2 k is accumulated. If a p. — 0, then the corresponding 
shift-add can be ignored (i.e., nop). The following VHDL example uses this 
“pencil and paper” scheme to multiply two 8-bit integers. 

Example 2.17: 8-bit Multiplier 

The VHDL description 0 of an 8-bit multiplier is developed below. Multiplica- 
tion is performed in three stages. First, the 8-bit operands are “loaded” and 
the product register reset. In the second stage, si, the actual serial-parallel 
multiplication takes place. In the third step, s2, the product is transferred to 
the output register y. 

PACKAGE eight_bit_int IS — User defined types 

SUBTYPE BYTE IS INTEGER RANGE -128 TO 127; 

SUBTYPE TWOBYTES IS INTEGER RANGE -32768 TO 32767; 

3 The equivalent Verilog code mul_ser.v for this example can be found in Ap- 
pendix A on page 445. 
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END eight_bit_int ; 

LIBRARY work; 

USE work . eight_bit_int . ALL ; 

LIBRARY ieee; — Using predefined packages 

USE ieee . std_logic_1164 .ALL; 

USE ieee. std_logic_arith. ALL; 

IS > Interface 

IN STD.LOGIC; 

IN BYTE; 

IN STD_LOGIC_VECTOR (7 DOWNTO 0); 

OUT TWOBYTES) ; 



ARCHITECTURE flex OF mul_ser IS 

TYPE STATE.TYPE IS (sO, si, s2) ; 

SIGNAL state : STATE.TYPE; 

BEGIN 

States: PROCESS > Multiplier in behavioral style 

VARIABLE p, t : TWOBYTES; — Double bit width 

VARIABLE count : INTEGER RANGE 0 TO 7 ; 

BEGIN 

WAIT UNTIL elk = ’1’ ; 

CASE state IS 

WHEN sO => — Initialization step 

state <= si; 
count := 0; 

p := 0; — Product register reset 

t := x; — Set temporary shift register to x 

WHEN si => — Processing step 

IF count = 7 THEN — Multiplication ready 
state <= s2; 

ELSE 

IF a(count) = ’ 1 J THEN 

p := p + t; — Add 2 ~k 
END IF; 
t := t * 2; 
count := count + 1; 
state <= si; 

END IF; 

WHEN s2 => — Output of result to y and 

y <= p; — start next multiplication 

state <= sO; 

END CASE; 

END PROCESS States; 



ENTITY mul.ser 
PORT ( elk 
x 
a 

y 

END mul ser; 



END flex; 




60 



2. Computer Arithmetic 




Figure 2.12 shows the simulation result of a multiplication of 13 and 5. 
The register t shows the partial product sequence of 5, 10, 20, .... Since 
13io = 00001 lOCc, the product register p is updated only three times in 
the production of the final result, 65. In state s2 the result 65 is transferred 
to the output y of the multiplier. The design uses 115 LCs and runs with a 
Registered Performance of 41.15 MHz. [ 2.17 | 



Because one operand is used in parallel (i.e., X) and the second operand 
A is used bitwise, the multipliers we just described are called serial/parallel 
multipliers. If both operands are used serial, the scheme is called a serial/serial 
multiplier [46], and such a multiplier only needs one full adder, but the latency 
of serial/serial multipliers is high 0(N 2 ), because the state machine needs 
about N 2 cycles. 

Another approach, which trades speed for increased complexity, is called 
an “array,” or parallel/parallel multiplier. A 4 x 4-bit array multiplier is shown 
in Fig. 2.13. Notice that both operands are presented in parallel to an adder 
array of N 2 adder cells. 

This arrangement is viable if the times required to complete the carry 
and sum calculations are the same. For a modern FPGA, however, the carry 
computation is performed faster than the sum calculation and a different ar- 
chitecture is more efficient for FPGAs. The approach for this array multiplier 
is shown in Fig. 2.14, for an 8 x 8-bit multiplier. This scheme combines in 
the first stage two neighboring partial products a n X 2 n and a n+ iX2 n+1 and 
the results are added to arrive at the final output product. This is a direct 
array form of the “pencil and paper” method and must therefore produce a 
valid product. 
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We recognize from Fig. 2.14 that this type of array multiplier gives the 
opportunity to realize a (parallel) binary tree of the multiplier with a total: 

number of stages in the binary tree multiplier = log 2 (A r ). (2.30) 

This alternative architecture also makes it easier to introduce pipeline stages 
after each tree level. The necessary number of pipeline stages, according to 
(2.30), to achieve maximum throughput is: 

Bit width 2 3-4 5-8 9- 16 17 - 32 

Optimal number 1 2 3 4 5 

of pipeline stages 

Figure 2.15 reports the Registered Performance of three pipelined N x 
TV-bit multipliers, using the MaxPlusII lpm_mult function, in a range from 
4 x 4 to 16 x 16 bits without (dotted line) and with (solid line) the use of 
I/O cell registers. Figure 2.16 shows the effort for the multiplier, with (solid 
line) and without (dotted line) using the I/O cell registers. Placing the input 
register close to the multiplier (turn off option: Assign-^Global Project 
Logic Synthes isH>- Automatic Fast I/O), produces, in the case of the 4x4 
bit multiplier, essential gain in performance. The maximum size of multiplier 
that fits in the FLEX10K70 is a 34 x 34-bit unit that uses 2653 LCs and runs 
at 25.77 MHz Registered Performance with six pipeline stages. 
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Fig. 2.14. Fast array multiplier for FPGAs. 

Other multiplier architectures typically used in the ASIC world include 
Wallace-tree multipliers and Booth multipliers. They are discussed in Exer- 
cises 2.1 (p. 103) and 2.2 (p. 104) but are rarely used in connection with 
FPGAs. 

2.4.1 Multiplier Blocks 

A 2N x 2N multiplier can be defined in terms of an TV x N multiplier block. 
The resulting multiplication is defined as: 

P = Y x X = (Y 2 2 n + Y 1 )(X 2 2 n + Xi) 

= Y 2 Y 2 2 2N + (Y 2 X 1 + YiXi)^ +Y 1 X 1 , (2.31) 

where the indices 2 and 1 indicate the most significant half and least signif- 
icant TV-bit halves, respectively. This partitioning scheme can be used if the 
capacity of the FPGA is insufficient to implement a multiplier of desired size, 
or used to implement a multiplier using 2-kbit EAB blocks. The number of 
EAB blocks in the FLEX10K70 is limited to 9, and the maximum symmet- 
ric multiplier MaxPlusII can compile is 8 x 8, although a 12 x 12 multiplier 
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Fig. 2.15. Performance of array multiplier for FPGAs, with (solid line) and 
without (dotted line) using the I/O cell registers. 



should be theorectically possible. Table 2.8 shows the data for an EAB-based 
multiplier. Comparing the data of Table 2.8 with the data from Figs. 2.15 
(p. 63) and 2.16 (p. 64), it can be seen that the EAB-based multiplier reduces 
the number of LCs but does not improve the Registered Performance. 



2.5 Binary Dividers 

From all four basic arithmetic operations division is the most complex. Con- 
sequently, it is the most time-consuming operation and also the operation 
with the largest number of different algorithms to be implemented. For a 
given dividend (or numerator) N and divisor (or denominator) D the divi- 
sion produces (unlike the other basic arithmetic operations) two results: the 
quotient Q and the remainder R , i.e., 

N 

— — Q and R with \R\ < D. (2.32) 

However, we may think of division as the inverse process of multiplication, 
as demonstrated through the following equation, 
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0 1 2 3 4 5 6 

Number of pipeline stages 

Fig. 2 . 16 . Effort in LCs for array multipliers, with (solid line) and without 
(dotted line) using the I/O cell registers. 

N = D x Q + R, (2.33) 

it differs from multiplication in many aspects. Most importantly, in multipli- 
cation all partial products can be produced parallel, while in division each 
quotient bit is determined in a sequential “trail-and-error” procedure. 

Because most microprocessors handle division as the inverse process to 
multiplications, referring to (2.33), the numerator is assumed to be the result 
of a multiplication and has therefore twice the bit width of denominator and 
quotient. As a consequence, the quotient has to be checked in an awkward 
procedure to be in the valid range, i.e., that there is no overflow in the 
quotient. We wish to use a more general approach in which we assume that 

Q < N and \R\ < D, 

i.e., quotient and numerator as well as denominator and remainder are as- 
sumed to be of the same bit width. With this bit width assumptions no range 
check (except N — 0) for a valid quotient is necessary. 

Another consideration when implementing division comes when we deal 
with signed numbers. Obviously, the easiest way to handle signed numbers is 
first to convert both to unsigned numbers and compute the sign of the result 
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Table 2.8. Data for EAB-based multipliers. 



Size 


Use I/O 
register 


LCs 


EABs 


Pipeline 

stages 


Registered 

Performance 


4x4 


/ 


0 


1 


0 


25.38 MHz 


4x4 


/ 


0 


1 


1 


41.15 MHz 


4x4 


/ 


0 


1 


2 


56.49 MHz 


4x4 


— 


16 


1 


0 


31.74 MHz 


4x4 


— 


16 


1 


1 


46.08 MHz 


4x4 


- 


16 


1 


2 


70.92 MHz 


8x8 


7 


29 


4 


0 


18.38 MHz 


8x8 


/ 


29 


4 


1 


31.54 MHz 


8x8 


/ 


29 


4 


2 


31.54 MHz 


8x8 


— 


49 


4 


0 


19.92 MHz 


8x8 


— 


49 


4 


1 


35.97 MHz 


8x8 


- 


49 


4 


2 


36.10 MHz 



as an XOR or modulo 2 add operation of the sign bits of the two operands. 
But some algorithms, (like the nonrestoring division discussed below), can 
directly process signed numbers. Then the question arises, how are the sign 
of quotient and remainder related. In most hardware or software systems (but 
not for all, such as in the PASCAL programming language), it is assumed 
that the remainder and the quotient have the same sign. That is, although 

234 

— — = 5 and R = —16 (2.34) 

50 

meets the requirements from (2.33), we, in general, would prefer the following 
results 

234 

= 4 and R = 34. (2.35) 

50 

Let us now start with a brief overview of the most commonly used division 
algorithms. Figure 2.17 shows the most popular linear and quadratic conver- 
gence schemes. A basic categorization of the linear division algorithms can 
be done according to the permissible values of each quotient digit generated. 
In the binary restoring , nonperforming or CORDIC algorithms the digits are 
selected from the set 

{ 0 , 1 }. 

In the binary nonrestoring algorithms a signed-digit set is used, i.e., 

In the binary SRT algorithm, named after Sweeney, Robertson, and Tocher 
[25] who discovered the algorithms at about the same time, the digits from 
the ternary set 

{-1, 0, 1} = {T, 0, 1} 




66 



2. Computer Arithmetic 




Fig. 2.17. Survey of division algorithms. 



are used. All of the above algorithms can be extended to higher radix algo- 
rithms. The generalized SRT division algorithms of radix r, for instance, uses 
the digit set 

{— 2 r — l,...,— 1,0,1, ...,2 r — l}. 

We find two algorithms with quadratic convergence to be popular. The 
first algorithm is the division by reciprocation of the denominator, where 
we compute the reciprocal with the Newton algorithm for finding zeros. The 
second quadratic convergence algorithms was developed for the IBM 360/91 
in the 1960s by Anderson et al. [47]. This algorithm multiplies numerator 
and denominator with the same factors and converges N — » 1, which results 
in D — >■ Q. Note, that the division algorithms with quadratic convergence 
produce no remainder. 

Although the number of iterations in the quadratic convergence algo- 
rithms are in the order of log 2 (6) for b bit operands, we must take into account 
that each iteration step is more complicated (i.e., uses two multiplications) 
than the linear convergence algorithms, and speed and size performance com- 
parisons have to be done carefully. 



2.5.1 Linear Convergence Division Algorithms 

The most obvious sequential algorithms is our a pencil-and-paper” method 
(which we have used many times before) translated into binary arithmetic. 
We align first the denominator and load the numerator in the remainder 
register. We then subtract the aligned denominator from the remainder and 
store the result in the remainder register. If the new remainder is positive 
we set the quotient’s LSB to 1, otherwise the quotient’s LSB is set to zero 
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and we need to restore the previous remainder value by adding the denomi- 
nator. Finally, we have to realign the quotient and denominator for the next 
step. The recalculation of the previous remainder is why we call such an 
algorithm “restoring division.” The following example demonstrates a FSM 
implementation of the algorithm. 

Example 2.18: 8-bit Restoring Divider 

The VHDL description 4 of an 8-bit divider is developed below. Division is 
performed in four stages. First, the 8-bit numerator is “loaded” in the re- 
mainder register, the 6-bit denominator is loaded and aligned (by 2 JV_1 for 
a N bit numerator), and the quotient register reset. In the second and third 
stages, si and s2, the actual serial division takes place. In the fourth step, 
s3, quotient and remainder are transferred to the output registers. Nomi- 
nator and quotient are assumed to be 8 bits wide, while denominator and 
remainder are 6-bit values. 

— Restoring Division 

LIBRARY ieee; — Using predefined packages 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL; 

USE ieee . std_logic_unsigned. ALL; 

ENTITY div_res IS > Interface 

GENERIC (WN : INTEGER := 8; 

WD : INTEGER := 6; 

P02WND : INTEGER := 8192; — 2**(WN+WD) 

P02WN1 : INTEGER := 128; — 2**(WN-1) 

P02WN : INTEGER := 255); — 2**WN-1 

PORT ( elk : IN STD_L0GIC; 

n_in : IN STD_L0GIC_VECT0R(WN-1 DOWNTO 0); 

d_in : IN STD_L0GIC_VECT0R(WD-1 DOWNTO 0) ; 

r_out : OUT STD_L0GIC_VECT0R(WD-1 DOWNTO 0); 

q_out : OUT STD_L0GIC_VECT0R(WN-1 DOWNTO 0)); 

END div_res; 

ARCHITECTURE flex OF div.res IS 

SUBTYPE TW0W0RDS IS INTEGER RANGE -1 TO P02WND-1; 

SUBTYPE WORD IS INTEGER RANGE 0 TO P02WN ; 



TYPE STATE.TYPE IS (sO, si, s2, s3) ; 

SIGNAL state : STATE.TYPE; 

BEGIN 

— Bit width: WN WD WN WD 

Nominator / Denumerator = Quotient and Remainder 

— OR: Nominator = Quotient * Denumerator + Remainder 

States: PROCESS > Divider in behavioral style 

VARIABLE r, d : TW0W0RDS ; — N+D bit width 



VARIABLE q : WORD; 

4 The equivalent Verilog code divjres.v for this example can be found in Ap- 
pendix A on page 446. 
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VARIABLE count : INTEGER RANGE 0 TO WN; 

BEGIN 

WAIT UNTIL elk = >1 ’ ; 

CASE state IS 

WHEN sO => — Initialization step 

state <= si ; 
count : = 0 ; 

q := 0; — Reset quotient register 

d := P02WN1 * CONV_INTEGER(d_in) ; — Load denumer. 
r := CONV_INTEGER(n_in) ; — Remainder = nominator 
WHEN si => — Processing step 

r := r - d; — Subtract denumerator 

state <= s2; 

WHEN s2 => — Restoring step 

IF r < 0 THEN 

r := r + d; — Restore previous remainder 

q := q * 2; — LSB = 0 and SLL 

ELSE 

q := 2 * q + 1; — LSB = 1 and SLL 
END IF; 

count := count + 1; 
d := d / 2; 

IF count = WN THEN — Division ready ? 

state <= s3; 

ELSE 

state <= si ; 

END IF; 

WHEN s3 => — Output of result 

q.out <= C0NV_STD_L0GIC_VECT0R (q, WN) ; 
r_out <= C0NV_STD_L0GIC_VECT0R (r , WD) ; 
state <= sO; — Start next division 

END CASE; 

END PROCESS States; 

END flex; 
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Figure 2.18 shows the simulation result of a division of 234 and 50. The 
register d shows the aligned denominator values 50 x 2 7 = 6400, 50 x 2 6 = 
3200, .... Every time the remainder r (shown is the simulation as hex value 
due to its large bit width) calculated in step si is negative, the previous 
remainder is restored in step s2. In state s3 the quotient 4 and the remainder 
34 are transferred to the output registers of the divider. The design uses 
179 LCs and runs with a Registered Performance of 37.59 MHz. | 2 . is | 



The main disadvantage of the restoring division is that we need two steps 
to determine one quotient bit. We can combine the two steps using a non- 
performing divider algorithm, i.e., each time the denominator is larger than 
the remainder, we do not perform the subtraction. In VHDL we would write 
the new step as: 

t := r - d; 

IF t >= 0 THEN 
r := t; 

q := q * 2 + 1; 

ELSE 

q := q * 2; 

END IF; 

The number of steps is reduced by a factor of 2 (not counting initialization 
and transfers of results), as can be seen from the simulation in Fig. 2.19. 
Note also from the simulation shown in Fig. 2.19 that the remainder r is 
never negative in the nonperforming division algorithms. On the downside the 
worst case delay path is increased when compared with the restoring division 
and the maximum Registered Performance is expected to be reduced by 
approximately 20%, see Problem 2.17 (p. 106). The nonperforming divider 
has two arithmetic operations and the if condition in the worst case path, 
while the restoring divider has (see step s2) only the if condition and one 
arithmetic operation in the worst case path. 

A similar approach to the nonperforming algorithm, but that does not 
increase the critical path, is the so-called nonrestoring division. The idea 
behind the nonrestoring division is that if we have computed in the restoring 
division a negative remainder, i.e., 77- +1 = rk—dk , then in the next step we will 
restore by adding dk and then perform a subtraction of the next aligned 
denominator dk + 1 = dk/ 2. So, instead of adding dk followed by subtracting 
dk/ 2, we can just skip the restoring step and proceed with adding dk/ 2, when 
the remainder has (temporarily) a negative value. As a result, we have now 
quotient bits that can be positive or negative, i.e., q k — ±1, but not zero. 
We can change this signed-digit representation later to a two’s complement 
representation. In conclusion, the nonrestoring algorithms works as follows: 
every time the remainder after the iteration is positive we store a 1 and 
subtract the aligned denominator, while for negative remainder, we store a 
— l — 1 in the quotient register and add the aligned denominator. To use only 



— temporary remainder value 

— Nonperforming test 

— Use new denominator 

— LSB = 1 and SLL 

— LSB = 0 and SLL 
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one bit in the quotient register we will use a zero in the quotient register to 
code the —1. To convert this signed-digit quotient back to a two’s complement 
word, the straightforward way is to put all Is in one word and the zeros, which 
are actually the coded — 1 = 1 in the second word as a one. Then we need 
just to subtract the two words to compute the two’s complement. On the 
other hand this subtraction of the — Is is nothing other than the complement 
of the quotient augmented by 1. In conclusion, if q holds the signed-digit 
representation, we can compute the two’s complement via 

? 2 C = 2 x </sd + 1- (2.36) 

Both quotient and remainder are now in the two’s complement representation 
and have a valid result according to (2.33). If we wish to constrain our results 
in a way that both have the same sign, we need to correct the negative 
remainder, i.e., for r < 0 we correct this via 

r := r + D and q q — 1. 

Such a nonrestoring divider will now run faster than the nonperforming di- 
vider, with about the same Registered Performance as the restoring di- 
vider, see Problem 2.18 (p. 106). Figure 2.20 shows a simulation of the non- 
restoring divider. We notice from the simulation that register values of the 
remainder are allowed now again to be negative. Note also that the above- 
mentioned correction for negative remainder is necessary for this value. The 
not corrected result is q = 5 and r = — 16 (displayed in MaxPlusII as 
r = 64 — 16 = 48). The equal sign correction results in g = 5 — 1 = 4 
and r = —16 + 50 = 34, as shown in Fig. 2.20. 

To shorten further the number of clock cycles needed for the division 
higher radix (array) divider can be built using, for instance, the SRT and 
radix 4 coding. This is popular in ASIC designs when combined with the 
carry-save-adder principle as used in the floating-point accelerators of the 
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Fig. 2.20. Simulation results for a nonrestoring divider. 

Pentium microprocessors. For FPGAs with a limited LUT size this higher- 
order schemes seem to be less attractive. 

A totally different approach to improve the latency are the division algo- 
rithms with quadratic convergence, which use fast array multiplier. The two 
most popular versions of this quadratic convergence schemes are discussed in 
the next section. 



2.5.2 Fast Divider Design 



The first fast divider algorithm we wish to discuss is the division through 
multiplication with the reciprocal of the denominator D. The reciprocal can, 
for instance, be computed via a look-up table for small bit width. The general 
technique for constructing iterative algorithms, however, makes use of the 
Newton method for finding a zero. According to this method, we define a 
function 

f(x) = \-D -> 0. (2.37) 

If we define an algorithm such that f{x OQ ) = 0 then it follows that 

— D = 0 or Xoo = d (2.38) 

Xoo L> 



Using the tangent the estimation for the next Xk+i is calculated using 

f{ x k) on ^ 

Xk + i=X k -— r, (2.39) 

J f (Xk) 

with f(x) = 1/x—D we ha vef f (x) — \/x 2 and the iteration equation becomes 
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Fig. 2.21. Newton’s zero-finding algorithms for Xco = 1/0.8 = 1.25. 



Although the algorithm will converge for any initial D , it converges much 
faster if we start with a normalized value close to 1.0, i.e., we normalized D 
in such a way that 0.5 < D < lorl < D < 2 as used for floating-point 
mantissa, see Sec. 2.6 (p. 76). We can then use an initial value a? 0 = 1 to 
get fast convergence. Let us illustrate the Newton algorithm with a short 
example. 

Example 2.19: Newton Algorithm 

Let us try to compute the Newton algorithm for 1/D = 1/0.8 = 1.25. The 
following table shows in the first column the number of the iteration, in the 
second column the approximation to 1/D, in the third column the error Xk — 
Xoo , and in the last column the equivalent bit precision of our approximation. 



k 


Xk 


Xk ~ ^00 


eff. bits 


0 


1.0 


-0.25 


2 


1 


1.2 


- 0.05 


4.3 


2 


1.248 


- 0.002 


8.9 


3 


1.25 


- 3.2 x 10~ 6 


18.2 


4 


1.25 


- 8.2 x 10 -12 


36.8 



Figure 2.21 shows a graphical interpretation of the Newton zero-finding al- 
gorithm. The f(xk) converges rapidly to zero. | 2.19 | 



Because the first iterations in the Newton algorithm only produce a few bits 
of precision, it may be useful to use a small look-up table to skip the first 
iterations. A table to skip the first two iterations can, for instance, be found 
in [25, p. 260]. 

We note also from the above example the overall rapid convergence of the 
algorithm. Only 5 steps are necessary to have over 32-bit precision. Many 
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more steps would be required to reach the same precision with the linear 
convergence algorithms. This quadratic convergence applies for all values not 
only for our special example. This can be shown as follows: 

^k -\- 1 — *£/c + l ^co — Dx 

= ~ D { Xk ~ Jj) = ~ De k - 

i.e., the error improves in a quadratic fashion from one iteration to the next. 
With each iteration we double the effective number of bit precision. 

Although the Newton algorithm has been successfully used in micropro- 
cessor design (e.g., IBM RISC 6000), it has two main disadvantages: First, the 
two multiplications in each iteration are sequential, and second, the quanti- 
zation error of the multiplication is accumulated due to the sequential nature 
of the multiplication. Additional guard bits are used in general to avoid this 
quantization error. 

The following convergence algorithm, although similar to the Newton al- 
gorithm, has an improved quantization behavior and uses 2 multiplications 
in each iteration that can be computed parallel. 

In the convergence division scheme both numerator N and denominator 
D are multiplied by approximation factors /&, which, for a sufficient number 
of iterations &, we find 

D fk — > 1 and N fk — > Q . (2-41) 

This algorithm, originally developed for the IBM 360/91, is credited to An- 
derson et al. [47], and the algorithm works as follows: 

Algorithm 2.20: Division by Convergence 

1) Normalize N and D such that D is close to 1. Use a normalization 
interval such as 0.5 < D < lor 1 < D < 2 as used for floating-point 
mantisaa. 

2) Initialize xo = N and to = D. 

3 ) Repeat the following loop until Xk shows the desired precision. 

fk = 2 — tk 
— %k x fk 
ik + 1 = ik X fk 

It is important to note that the algorithm is self-correcting. Any quan- 
tization error in the factors does not really matter because numerator and 
denominator are multiplied with the same factor fk . This fact has been used 
in the IBM 360/91 design to reduce the required resources. The multiplier 
used for the first iteration has only a few significant bits, while in later iter- 
ation more multiplier bits are allocated as the factor fk gets closer to 1. 

Let us demonstrate the multiply by convergence algorithm with the fol- 
lowing example. 
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Example 2.21: Anderson-Earle— Goldschmidt-Powers Algorithm 

Let us try to compute the division by convergence algorithm for N = 1.5 
and D = 1.2, i.e., Q = N/D = 1.25 The following table shows in the first 
column the number of the iteration, in the second column the scaling factor 
fk , in the third column the approximation to N/D , in the fourth column 
the error Xk — £oo, and in the last column the equivalent bit precision of our 
approximation. 



k 


fk 


Xk 


Xk ~ Xoo 


eff. bits 


0 


0.8 « 


1.5 « §|i 


0.25 


2 


1 


1.04 « || 

1.0016 » H 


1 - 248^1 


-0.05 


4.3 


2 


0.002 


8.9 


3 


1.0 + 2.56 x 10 -6 


1.25 


-3.2 x 10 -6 


18.2 


4 


1.0 + 6.55 x 10~ 12 


1.25 


-8.2 x 10 -12 


36.8 



We note the same quadratic convergence as in the Newton algorithm, see 
Example 2.19 (p. 72). 

The VHDL description 5 of an 8-bit fast divider is developed below. We as- 
sume that denominator and numerator are normalized as, for instance, typ- 
ical for floating-point mantissa values, to the interval 1 < N, D < 2. This 
normalization step may require essential addition resources (leading zeros 
detection and two barrelshifters) when denominator and numerator are not 
normalized. Nominator, denominator, and quotient are assumed to be all 9- 
bit wide. The decimal values 1.5, 1.2, and 1.25 are represented in a 1.8-bit 
format (1 integer and 8 fractional bits) as 1.5 x 256 = 384, 1.2 x 256 = 307, 
and 1.25 x 256 = 320, respectively. Division is performed in three stages. 
First, the 1.8-formatted denominator and numerator are “loaded” in the reg- 
isters. In the second state, si, the actual convergence division takes place. In 
the third step, s2, the quotient is transferred to the output register. 

— Convergence division after Anderson, Earle, Goldschmidt, 
LIBRARY ieee; — and Powers 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL; 

USE ieee . std_logic_unsigned. ALL ; 



ENTITY div_aegp IS > Interface 

GENERIC (WN : INTEGER := 9; — 8 bit plus one integer bit 
WD : INTEGER := 9; 

STEPS : INTEGER := 2; 

TWO : INTEGER := 512; — 2**(WN+1) 



END 



P02WN 
P02WN2 
PORT ( elk 
n_in 
d_in 
q_out 
div_aegp ; 



- 2** (WN-1) 

— 2**(WN+1)-1 



INTEGER := 256; 

INTEGER := 1023); 

IN STD_L0GIC; 

IN STD_L0GIC_ VECTOR (WN-1 D0WNT0 0); 
IN STD_L0GIC_VECT0R(WD-1 D0WNT0 0); 
OUT STD_L0GIC_VECT0R(WD-1 D0WNT0 0)); 



ARCHITECTURE flex OF div.aegp IS 



SUBTYPE WORD IS INTEGER RANGE 0 TO P02WN2; 

The equivalent Verilog code div_aegp.v for this example can be found in Ap- 
pendix A on page 448. 
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TYPE STATE.TYPE IS (sO, si, s2) ; 

SIGNAL state : STATE.TYPE; 

BEGIN 

— Bit width: WN WD WN WD 

Nominator / Denumerator = Quotient and Remainder 

— OR: Nominator = Quotient * Denumerator + Remainder 

States: PROCESS > Divider in behavioral style 

VARIABLE x, t, f : WORD; -- WN+1 bits 
VARIABLE count : INTEGER RANGE 0 TO STEPS; 

BEGIN 

WAIT UNTIL elk = >1 ’ ; 

CASE state IS 

WHEN sO => — Initialization step 

state <= si; 
count : = 0 ; 

t := CONV_INTEGER(d_in) ; — Load denominator 
x := CONV_INTEGER(n_in) ; — Load nominator 
WHEN si => — Processing step 

f := TWO - t; 
x := x * f / P02WN ; 
t := t * f / P02WN ; 
count := count + 1; 

IF count = STEPS THEN — Division ready ? 

state <= s2; 

ELSE 

state <= si; 

END IF; 

WHEN s2 => — Output of results 

q_out <= C0NV_STD_L0GIC_VECT0R (x , WN) ; 
state <= sO; — start next division 

END CASE; 

END PROCESS States; 

END flex; 

Figure 2.22 shows the simulation result of a division 1.5/ 1.2. The variable f 
(becomes an internal net and is not shown in the simulation) holds the three 
scaling factors 205, 267, and 257, sufficient for 8-bit precision results. The x 
and t values are multiplied by the scaling factor f and scaled down to the 1.8 
format, x converges to the quotient 1.25=320/256, while t converges to 1.0 = 
255/256, as expected. In state s3 the quotient 1.25 = 320/256 is transferred 
to the output registers of the divider. Note that the divider produces no re- 
mainder. The design uses 469 LCs and runs with a Registered Performance 
of 14. 59 MHz. I 2 . 2 i I 



Although the Registered Performance of the nonrestoring divider (see 
Fig. 2.20) is about twice as high, the total latency, however, in the convergence 
divider is reduced, because the number of processing steps are reduced from 8 
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Fig. 2.22. Simulation results for a convergence divider. 



to \y/8] = 3 (not counting initialization in both algorithms). The convergence 
divider uses more than twice as many LCs as the nonrestoring divider. 

2.5.3 Array Divider 

Obviously, as with multipliers, all division algorithms can be implemented in 
a sequential, FSM-like, way or in the array form. If the array form and pipelin- 
ing is desired, a good option will then be to use the lpm_divide block, which 
implements an array divider with the option of pipelining, see Appendix B, 
(p. 506) for a detailed description of the lpm_divide block. 

Figure 2.23 shows the Registered Performance and Fig. 2.24 the LCs 
necessary for 4 x 4— ,8 x 8— , and 16 x 16— bit array dividers, with (solid 
line) and without (dotted line) using the I/O cell registers. We note from the 
performance measurement, that the optimal number of pipeline stages is the 
same as the number of bits in the denominator. 



2.6 Floating-point Arithmetic Implementation 

Due to the large gate count capacity of current FPGAs the design of floating- 
point arithmetic has become a viable option. In addition, the introduction 
of the embedded 18 x 18 bit array multiplier in Altera Stratix and Xilinx 
Virtex II and Spartan III FPGA device families allows an efficient design 
of custom floating-point arithmetic. We will therefore discuss the design of 
basic building blocks such as a floating-point adder, subtractor, multiplier, 
reciprocal and divider, and the necessary conversion blocks to and from fixed- 
point data format. Such blocks are available from several IP providers, or 
through special request via e-mail to Uwe.Meyer-Baese@ieee.org. 
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Fig. 2.23. Performance of array divider using the lpm_divide macro block. 



Most of the commercially available floating-point blocks use (typically 3) 
pipeline stages to increase the throughput. To keep the presentation simple we 
will not use pipelining. The custom floating-point format we will use is the 
(1,6,5) floating-point format introduced in Sec. 2.2.3, (p. 47). This format 
uses 1 sign bit, 6 bits for the exponent and 5 bits for the mantissa. We 
support special coding for zero and infinities, but we do not support NaNs 
or denormals. Rounding is done via truncation. The fixed-point format used 
in the examples has 6 integer bits (including a sign bit) and 6 fractional bits. 

2.6.1 Fixed-point to Floating-point Format Conversion 

As shown in Sec. 2.2.3, (p. 47), floating-point numbers use a signed-magnitude 
format and the first step is therefore to convert the two’s complement number 
to signed-magnitude form. If the sign of the fixed-point number is one, we 
need to compute the complement of the fixed-point number, which becomes 
the unnormalized mantissa. In the next step we normalize the mantissa and 
compute the exponent. For the normalization we first determine the number 
of leading zeros. This can be done with a LOOP statement within a sequential 
PROCESS in VHDL. Using this number of leading zeros, we shift the mantissa 
left, until the first 1 u leaves” the mantissa registers, i.e., the hidden one is also 
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Fig. 2.24. Effort in LCs for array divider using the lpm_divide macro block. 



removed. This shift operation is actually the task of a barrelshifter, which 
can be inferred in VHDL via the SLL instruction. Unfortunately the SLL is 
not supported in Altera’s MaxPlusII because it is part of the VHDL 1993 
but not 1987 standard, but we can design a barrelshifter in many different 
ways as Exercise 2.19 (p. 107) shows. 

The exponent of our floating-point number is computed as the sum of 
the bias and the number of integer bits in our fixed-point format minus the 
leading zeros in the not normalized mantissa. 

Finally, we concatenate the sign, exponent, and the normalized mantissa 
to a single floating-point word if the fixed-point number is not zero, otherwise 
we set the floating-point word also to zero. 

We have assumed that the range of the floating-point number is larger 
than the range of the fixed-point number, i.e., the special number oo will 
never be used in the conversion. 

Figure 2.25 shows the conversion from 12-bit fixed-point data to the 
(1,6,5) floating-point data for five values ±1, absolute maximum, absolute 
minimum, and the smallest value. Row 1 shows the decimal values, rows 2 
to 4 show the 12-bit fixed-point number and the integer and fractional parts. 
Rows 5 to 8 show the complete floating-point number, followed by the three 
parts, sign, exponent, and mantissa. 
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Fig. 2.25. Simulation results for a (1,5,6) fixed-point format to (1,6,5) floating- 
point conversion. 



2.6.2 Floating-point to Fixed-point Format Conversion 

The floating-point to fixed-point conversion is, in general, more complicated 
than the conversion in the other direction. Depending if the exponent is 
larger or smaller than the bias we need to implement a left or right shift 
of the mantissa. In addition, extra consideration is necessary for the special 
values ±oc and ±0. 

To keep the discussion as simple as possible, we assume in the following 
that the floating-point number has a larger dynamic range than the fixed- 
point number, but the fixed-point number has a higher precision, i.e., the 
number of fractional bits of the fixed-point number is larger than the bits 
used for the mantissa in the floating-point number. 

The first step in the conversion is the correction of the bias in the expo- 
nent. We then place the hidden 1 to the left and the (fractional) mantissa to 
the right of the decimal point of the fixed-point word. We then check if the 
exponent is too large to be represented with the fixed-point number and set 
the fixed-point number then to the maximum value. Also, if the exponent is 
too small, we set the output value to zero. If the exponent is in the valid range 
that the floating-point number can be represented with the fixed-point for- 
mat, we shift left the l.m mantissa value (format see 2.23, p. 47) for positive 
exponents, and shift right for negative exponent values. This, in general, can 
be coded with the SLL and SRL in VHDL, respectively, but these 1993 stan- 
dard features are not supported in Altera’s MaxPlusII. In the final step we 
convert the signed magnitude representation to the two’s complement format 
by evaluating the sign bit of the floating-point number. 

Figure 2.26 shows the conversion from (1,6,5) floating-point format to 
(1,5,6) fixed-point data for the five values ±1, absolute maximum, absolute 
minimum, and the smallest value. Row 1 shows the decimal values, rows 2 to 
5 the 12-bit floating-point number and the three parts, sign, exponent, and 
mantissa. The rows 6 to 8 show the complete fixed-point number, followed 
by the integer and fractional parts. Note that the conversion is without any 
quantization error for ±1 and the smallest value. For the absolute maximum 
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Fig. 2.26. Simulation results for (1,6,5) floating-point format to (1,5,6) fixed-point 
format conversion. 

and minimum values, however, the smaller precision in the floating-point 
numbers gives the imperfect conversion values compared with Fig. 2.25. 

2.6.3 Floating-point Multiplication 

In contrast to fixed-point operations, multiplication in floating-point is the 
simplest of all arithmetic operations and we will discuss this first. In general, 
the multiplication of two numbers in scientific format is accomplished by 
multiplication of the mantissas and adding of the exponents, i.e., 

/i x / 2 = (ai‘2 ei ) x (a 2 2 62 ) = (ai x a 2 )2 6l+e2 . 

For our floating-point format with an implicit one and a biased exponent this 
becomes 

/i x / 2 = (~l) Sl (Lnur^) x (-1) 82 (l.m 2 2 e2_bias ) 

ei + e 2 — bias -bias 

= (_ 1 ) Jl+52 mod2( LraixLffl2)2 
ra 3 

= (-1 ) 53 l . m 3 2 e3-bias . 

We note that the exponent sum needs to be adjusted by the bias, since the 
bias is included twice in both exponents. The sign of the product is the XOR 
or modulo-2 sum of the two sign bits of the two operands. We need also to 
take care of the special values. If one factor is oo the product should be oo 
too. Next, we check if one factor is zero and set the product to zero if true. 
Because we do not support NaNs, this implies that 0 x oo is set to oo. Special 
values may also be producted from original nonspecial operands. If we detect 
an overflow, i.e., 

T ^2 bias E m a,xi 

we set the product to oo. Likewise, if we detect an underflow, i.e., 
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Fig. 2.27. Simulation results for multiplications with floating-point numbers in the 
(1,6,5) format. 



ii 4- e 2 — bias < E mm , 

we set the product to zero. It can be seen that the internal representation of 
the exponent e 3 of the product, must have two more bits than the two factors, 
because we need a sign and a guard bit. Fortunately, the normalization of 
the product l.m 3 is relatively simple, because both operands are in the range 
1.0 < < 2.0, the mantissa product is therefore in the range 1.0 < 

l.ra 3 < 4.0, i.e., a shift by one bit (and exponent adjustment by 1) is sufficient 
to normalize the product. 

Finally, we build the new floating-point number by concatenation of the 
sign, exponent, and magnitude. 

Figure 2.27 shows the multiplication in the (1,6,5) floating-point format 
of the following values (see also row 1 in Fig. 2.27): 

1) (-1) x (-1) = I.O 10 = I.OOOOO 2 x 2 31_bias 

2) 1.75 x 1.75 = 3.0625i O = 11.0001 2 x 2 31 " bias = 1.10001 2 x 2 32 " bias 

3) exponent: (7 + 7 — bias = — 17 < E mm -» underflow in multiplication 

4) 0 x 00 = 00 per definition (NaNs are not supported). 

5) -1.75 x 0 = -0 

The rows 2 to 5 show the first floating-point number f 1 and the three parts: 
sign, exponent, and mantissa. Rows 6 to 9 show the same for the second 
operand f 2, and rows 10 to 13 the product f 3 and the decomposition of the 
three parts. 

2.6.4 Floating-point Addition 

Floating-point addition is more complex than multiplication. Two numbers 
is scientific format 
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/ 3 = /i+/2 = («i2 ei )±(a22 e2 ) 

can only be added if the exponents are the same, i.e., e\ = e 2 . Without loss 
of generality we assume in the following that the second number has the 
(absolute) smaller value. If this is not true, we just exchange the first and the 
second number. The next step is now to “denormalize” the smaller number 
by using the following identity: 

a 2 2 62 = a 2 / 2 d 2 e2+d . 

If we select the normalization factor such as e 2 + d = ei, i.e., d = e\ — e 2 , we 
get 

a 2 / 2 d 2 e2 + d = a 2 /2 ei “ e2 2 ei . 

Now both numbers have the same exponent and we can, depending on the 
signs, add or subtract the first mantissa and the aligned second, according to 

a 3 = a! ±a 2 /2 ei “ e2 . 

We need also to check if the second operand is zero. This is the case if e 2 = 0 
or d > M, i.e., the shift operation reduces the second mantissa to zero. If the 
second operand is zero the first (larger) operand is forwarded to the result 

fa- 

The two aligned mantissas are added if the two floating-point operands 
have the same sign, otherwise subtracted. The new mantissa needs to be 
normalized to have the l.m 3 format, and the exponent, initially set to e 3 = e \ , 
needs to be adjusted accordingly to the normalization of the mantissa. We 
need to determine the number of leading zeros including the first one and 
perform a shift logic left (SLL). We also need to take into account if one of 
the operands is a special number, or if over- or underflow occurs. If the first 
operand is 00 or the new computed exponent is larger than E max the output 
is set to 00 . This implies that 00 — 00 = 00 since NaNs are not supported. 
If the new computed exponent is smaller than E mm , underflow has occurred 
and the output is set to zero. Finally, we concatenate the sign, exponent, and 
mantissa to the new floating-point number. 

Figure 2.28 shows the addition in the (1,6,5) floating-point format of the 
following values (see also row 1 in Fig. 2.28): 

1) 9.25 + (-10.5) = — 1.25io = 1.01000 2 x 2 31 " bias 

2 ) 1.0 + (- 1 . 0 ) = 0 

3) I.OOIII 2 x 2 2-bias + (-I.OOIOO 2 x 2 2-bias ) = O.OOOII 2 x 2 2-bias = 1.1 2 x 
2“ 2- bias _2 < E m i n — > underflow 

4) 1.01111 2 x2 62 - bias +1.111102x2 62 - bias = 11.01 101 2 2 62— bias = l.l2 63 - bias 
-4 63 > Fmax -* overflow 

5) —oc + 1 = —co 

The rows 2 to 5 show the first floating-point number f 1 and the three parts: 
sign, exponent, and mantissa. Rows 6 to 9 show the same for the second 
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Fig. 2.28. Simulation results for additions with floating-point numbers in the 
(1,6,5) format. 



operand f 2, and rows 10 to 13 show the sum f3 and the decomposition in 
the three parts, sign, exponent, and mantissa. 

2.6.5 Floating-point Division 

In general, the division of two numbers in scientific format is accomplished 
by division of the mantissas and subtraction of the exponents, i.e., 

fi/h = (ai2 ei )/(a 2 2 e2 ) = ( ai /a 2 )2 e . 

For our floating-point format with an implicit one and a biased exponent this 
becomes 

/1//2 = (-l) Sl (l.m 2 2 e2 - blas ) 

ei — e 2 — bias+bias 
^ ^ ^ 

= (-l) 5l+S2mod2 (l.mi/l.m 2 )2 e3 

' v ' 

ra 3 

= (-l) S3 l.m 3 2 e3+bias . 

We note that the exponent sum needs to be adjusted by the bias, since 
the bias is no longer present after the subtraction of the exponents. The 
sign of the division is the XOR or modulo-2 sum of the two sign bits of 
the two operands. The division of the mantissas can be implemented with 
any algorithm discussed in Sec. 2.5 (p. 63) or we can use the lpm_divide 
component. Because the denominator and quotient has to be at least M+l bits 
wide, but numerator and quotient have the same bit width in the lpm_divide 
component, we need to use numerator and quotient with 2 x (M -f 1) bits. 
Because the numerator and denominator are both in the range 1 < l.mi 2 < 
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Fig. 2.29. Simulation results for division with floating-point numbers in the (1,6,5) 
format. 



2, we conclude that the quotient will be in the range 0.5 < l.m 3 < 2. It follows 
that a normalization of only one bit (including the exponent adjustment by 
1) is required. 

We need also to take care of the special values. The result is oo if the 
numerator is oo, the denominator is zero, or we detect an overflow, i.e., 

ei - e 2 + bias = e 3 > T max . 

Then we check for a zero quotient. The quotient is set to zero if the numerator 
is zero, denominator is oo, or we detect an underflow, i.e., 

ei - e 2 + bias = e 3 < E mm . 

In all other cases the result is in the valid range that produces no special 
result. 

Finally, we build the new floating-point number by concatenation of the 
sign, exponent, and magnitude. 

Figure 2.29 shows the division in the (1,6,5) floating-point format of the 
following values (see also row 1 in Fig. 2.29): 

1) (— 1)/(— 1) = l.Oip = l.OOOOOa x 2 31_bias 

2 ) — 10.5/9.25io = 1.135i 0 » 1.001 2 x 2 31 “ bias 

3) 9.25 /( — 10.5)io = 0.880952io «l.ll 2 x 2 3 °- bias 

4) exponent: 60 — 3 + bias = 88 > E max — >■ overflow in division 

5) exponent: 3 — 60 + bias = — 26 < E m i n — t underflow in division 

6 ) 1 . 0/0 = oo 

7) 0/(— 1.0) = -0.0 

Rows 2 to 5 show the first floating-point number and the three parts: sign, 
exponent, and mantissa. Rows 6 to 9 show the same for the second operand, 
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and rows 10 to 13 show the quotient and the decomposition in the three 
parts. 



2.6.6 Floating-point Reciprocal 



Although the reciprocal function of a floating-point number, i.e., 



1 . 0 // 



1.0 

(-l)H.ra2 e 
( — l) s 2 _e /l.ra 



seems to be less frequently used than the other arithmetic functions, it is 
nonetheless useful since it can also be used in combination with the multiplier 
to build a floating-point divider, because 

r /r _ 1 *° r 
/1//2 — x /1, 

i.e., reciprocal of the denominator followed by multiplication is equivalent to 
the division. 

If the bit width of the mantissa is not too large, we may implement the 
reciprocal of the mantissa, via a look-up table implemented with a case 
statement or with a memory block, i.e., an EAB. Because the mantissa is in 
the range 1 < l.m < 2, the reciprocal must be in the range 0.5 < < 1. 

The mantissa normalization is therefore a one-bit shift for all values except 

/= 1 . 0 . 

The following include file fptab5.mif was generated with the program 
fpinv.exe (included on the CD-ROM under book2e/util) and shows the 
first few values for a 5-bit reciprocal look-up table. The file has the following 
contents: 



— This is the floating-point 1/x table for 5 bit data 
automatically generated with fpinv.exe — DO NOT EDIT! 
depth = 32; 
width = 5; 

address_radix = dec; 
data_radix = dec; 
content 
begin 



0 


0; 




1 


30 


— 30.060606 


2 


28 


— 28.235294 


3 


27 


— 26.514286 


4 


25 


— 24.888889 


5 


23 


— 23.351351 


6 


22 


— 21.894737 


7 


21 


— 20.512821 




86 



2. Computer Arithmetic 













moons 


700.0m aODOns 


900,0ns 1.C ! . 


Nairn 


Valua- 


00ns 

' 1 CD 0ns 200 this 


3CD l 0hs 


4000ns 


500 ,0m 






’1G*05 




1/1 25-0 8 


1/1 DG 1=09697 


1j0=lnftna V 




u*fi 


fi 1 IDOOOQOOOOO 


110000000000 


Ud 


101 1 1 1 101D0G 


J, 


1011111100001 


uz 


DOOOOOOOOOOO y 


011 1 11100000 




1 


1 1 -1 


5-1 


e idoooo 


100300 






□11111 




zx 


OOOCDO i 


nnn 


D| 


E 00000 


□0000 


j_ 


□1000 


_JT 


00001 


X 


□0000 


1 








c 


B 101111000000 


101111000000 


A 


10111 1Q10011 


If 


101111011110 


jr 


011111100000 t 






1 


- .f . j r \ z u 


£?&|1D 5| 


8D11110 






□11110 






nr 


Ill’ll J( 


000000 


Mm o| 


BDQQ00 




7T~ 


10011 


][ 


lino 


"X" 


ooooo 


1 






Jfl 


<LJ 
















1 L 





Fig. 2.30. Simulation results for reciprocal with floating-point numbers in the 
(1,6,5) format. 



8 : 19; — 19.200000 

END; 

We also need to take care of the special values. The reciprocal of oo is 
0, and the reciprocal of 0 is oo. For all other values the new exponent e 2 is 
computed with 

e 2 = — (ei — bias) + bias = 2 x bias — e \ . 

Finally, we build the reciprocal floating-point number by the concatena- 
tion of the sign, exponent, and magnitude. 

Figure 2.30 shows the reciprocal in the (1,6,5) floating-point format of the 
following values (see also row 1 in Fig. 2.30): 

1) -1/2 = — 0.5io = -1.0 2 x 2 30-bias 

2 ) l/1.25io = O.810 w (32 + 19)/64 = 1.10011 2 x 2 3 °- bias 

3) 1/1.031 = 0.9697io ^ (32 + 30)/64 = 1.11110 2 x 2 3 °- bias 

4) 1.0/0 = oc 

5) l/oo = 0.0 

For the first three values the entries (without leading 1) corresponds to the 
MIF file from above for the address line 0, 8, and 1, respectively. Rows 2 to 5 
show the input floating-point number f 1 and the three parts: sign, exponent, 
and mantissa. Rows 6 to 9 show the reciprocal f 2 and the decomposition in 
the three parts. 



2.6.7 Floating-point Synthesis Results 

In order to measure the Registered Performance, registers were added to 
the input and output ports, but no pipelining inside the block has been used. 
Table 2.9 shows the synthesis results for all six basic building blocks. As 
expected the floating-point adder is more complex than the multiplier or the 
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divider. The conversion blocks also use substantial resources. The reciprocal 
block uses besides the listed LCs also one embedded array block (EAB), or, 
more specifically, 160 bits of an EAB. 



Table 2.9. Synthesis results for floating-point design using the (1,6,5) data format. 



Block Use I/O No use of 

register I/O register 





MHz 


LCs 


MHz 


LCs 


f ix2fp 


11.8 


166 


12.51 


175 


f p2f ix 


13.51 


234 


13.15 


245 


fp_mul 


18.83 


130 


19.08 


153 


fp_add 


7.28 


198 


7.88 


222 


f p_div 


9.14 


150 


9.91 


167 


f p_rec 


26.66 


42 


30.76 


53 



These blocks are available from several “intellectual property” providers, 
or through special request via e-mail to Uwe.Meyer-Baese@ieee.org. 



2.7 Multiply-Accumulator (MAC) and Sum of Product 
(SOP) 

DSP algorithms are known to be multiply-accumulate (MAC) intensive. To 
illustrate, consider the linear convolution sum given by 

L- 1 

y[n\ = f[n] * x[n\ = ^ f[k]x[n - k } (2.42) 

k = 0 

requiring L consecutive multiplications and L — 1 addition operations per 
sample y[n\ to compute the sum of products (SOPs). This suggests that 
77 x 77-bit multipliers need to be fused together with an accumulator. A full- 
precision 77 x 77-bit product is 277 bits wide. If both operands are (symmetric) 
signed numbers, the product will only have 277 — 1 significant bits, i.e., two 
sign bits. The accumulator, in order to maintain sufficient dynamic range, 
is often designed to be an extra K bits in width, as demonstrated in the 
following example. 

Example 2.22: The Analog Devices PDSP family ADSP21xx contains a 16 x 16 
array multiplier and an accumulator with an extra 8 bits (for a total accumulator 
width of 32 + 8 = 40 bits). With this eight extra bits, at least 2 8 accumulations are 
possible without sacrificing the output . If both operands are signed, 2 9 accumulation 
can be performed. In order to produce the desired output format, such modern 
PDSPs include also a barrelshifter, which allows the desired adjustment within one 
clock cycle. | 2.22 | 
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This overflow consideration in fixed-point PDSP is important to main- 
stream digital signal processing, which requires that DSP objects be com- 
puted in real time without unexpected interruptions. Recall that checking 
and servicing accumulator overflow interrupts the data flow and carries a 
significant temporal liability. By choosing the number of guard bits correctly, 
the liability can be eliminated. 

An alternative approach to the MAC of a conventional PDSP for com- 
puting a sum of product will be discussed in the next section. 



2.7.1 Distributed Arithmetic Fundamentals 

Distributed arithmetic (DA) is an important FPGA technology. It is exten- 
sively used in computing the sum of products 

N—l 

y = { c, x ) = ^ c[n]x[n], (2.43) 

n=0 

Besides convolution, correlation, DFT computation and the RNS inverse 
mapping discussed earlier can also be formulated as such a “sum of prod- 
ucts” (SOPs). Completing a filter cycle, when using a conventional arith- 
metic unit, would take approximately N MAC cycles. This amount can be 
shortened with pipelining but can, nevertheless, be prohibitively long. This 
is a fundamental problem when general-purpose multipliers are used. 

In many DSP applications, a general-purpose multiplication is technically 
not required. If the filter coefficients c[n] are known a priori, then technically 
the partial product term c[n]x[n] becomes a multiplication with a constant 
(i.e., scaling). This is an important difference and is a prerequisite for a DA 
design. 

The first discussion of DA can be traced to a 1973 paper by Croisier [48] 
and DA was popularized by Peled and Liu [49]. Yiu [50] extended DA to 
signed numbers, and Kammeyer [51] and Taylor [52] studied quantization 
effects in DA systems. DA tutorials are available from White [53] and Kam- 
meyer [54]. DA also is addressed in textbooks [55, 56]. To understand the 
DA design paradigm, consider the “sum of products” inner product shown 
below: 



N-l 

y = (c,x)= ^ c[n] x x[n] 

n = 0 

= c[0]x[0] + c[l\x[l] + . . . + c[N - l]x[N - 1]. (2.44) 

Assume further that the coefficients c[n] are known constants and x[n] is 
a variable. An unsigned DA system assumes that the variable x[n] is repre- 
sented by: 




2.7 Multiply- Accumulator (MAC) and Sum of Product (SOP) 



89 



B - 1 

x[n] — ^ x b [n] x 2 6 with x b [n\ E [0, 1], (2.45) 

6=0 

where x b [n] denotes the 6 th bit of x[n\, i.e., the n th sample of x. The inner 
product y can, therefore, be represented as: 



N - 1 B- 1 

y = E C M x E x 2 b . (2.46) 

n = 0 6=0 

Redistributing the order of summation (thus the name “distributed arith- 
metic”) results in: 

y = c[0] («b- 1 [0]2 b-1 + x B . 2 m B ~ 2 + . . . + x o [0]2°) 

+c[l] (*_b_ 1 [1]2 b_1 + x b _ 2 [1]2 b - 2 + . . . + aro[l]2°) 



+c[N - 1] (x b ^[N - 1]2 b_1 + . . . + x 0 [N - 1]2°) 

= (c[0]i B _i[0] + c[l]*B-i[l] +... + c[N- IjxB^iN - 1]) 2 b_1 
+ (c[0]^ B _ 2 [0] + c[l]ar B - 2 [l] + ... + c[N- 1 ]x B _ 2 [N - 1]) 2 B “ 2 

+ (c[0]* o [0] + c[l]* 0 [l] + ... + c[N- l]i 0 [JV - 1]) 2°, 

or in more compact form 

E 2 6 x E / ( C W> x b[n]) . (2.47) 

6=0 n =0 

Implementation of the function f(c[n\, x b [n]) requires special attention. The 
preferred implementation method is to realize the mapping /(c[n], x b [n]) us- 
ing one LUT. That is, a 2 iV -word LUT is preprogrammed to accept an TV-bit 
input vector x b — [x 6 [0], x b [l], • • - ,x b [N - 1]], and output f(c[n],x b [n]). The 
individual mappings /(c[n], x b [n]) are weighted by the appropriate power-of- 
two factor and accumulated. The accumulation can be efficiently implemented 
using a shift-adder as shown in Fig. 2.31b. After N look-up cycles, the inner 
product y is computed. 



B-l N-l 

y = E 2 b x E C M x x b[ n \ = 

b-0 n—0 ^( c r n ] ia . t r„]) 





Fig. 2.31. Conventional PDSP and Shift- Adder DA Architecture. 



Example 2.23: Unsigned DA Convolution 

A third-order inner product is defined by the inner product equation y = 

2 

(c, x) — c[ra]:r[n]. Assume that the 3-bit, coefficients have the values c[0] = 

n = 0 

2, c[l] = 3, and c[ 2] = 1. The resulting LUT, which implements f(c[n], _r(,[n]), 
is defined below: 

Xb[2] r&[l] £t[0] f(c[rt], ®[n]) 

0 0 0 lx 0+3 x 0+2 x 0=Oio=000 2 

0 0 1 lx 0+3 x 0+2 x 1=2 io=001 2 

0 10 lx 0+3 x 1+2 x 0=3 io=011 2 

0 11 lx 0+3 x 1+2 x 1=5 10 = 101 2 

10 0 lx 1+3 x 0+2 x 0=lio=001 2 

10 1 lx 1+3 x 0+2 x 1=3 io=011 2 

110 lx 1+3 x 1+2 x 0=4 io = 100 2 

111 lx 1+3 x 1+2 x 1=6 io = 110 2 
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The inner product, with respect to x[u] = {x[0] = lio = 00l2,^[l] = 3io = 
01l2,#[2] = 7 10 = III2}, is obtained as follows: 

Step t x t \2\ #*[1] x t [ 0] f[t] +ACC[t — 1 ]=ACC[t\ 

0 111 6x2°+ 0 =6 

1 110 4 x 2 X + 6 =14 

2 1 0 0 1 x 2 2 + 14 = 18 

As a numerical check, note that 

y — (c, x) = c[0]a;[0] + c[l]a;[l] + c[2]a;[2] 

= 2x1 + 3x34-1x7 = 18./ 



For a hardware implementation, instead of shifting each intermediate 
value by b (which will demand an expensive barrelshifter) it is more ap- 
propriate to shift the accumulator content itself in each iteration one bit to 
the right. It is easy to verify that this will give the same results. 

The bandwidth of an 7V th -order 5-bit linear convolution, using general 
purpose MACs and DA hardware, can be compared. Figure 2.31 shows the 
architectures of a conventional PDSP and the same realization using dis- 
tributed arithmetic. 

Assume that a LUT and a general-purpose multiplier have the same delay 
r — r(LUT) = r(MUL). The computational latencies are then 5r(LUT) for 
DA and Ar(MUL) for the PDSP. In the case of small bit width 5, the speed of 
the DA design can therefore be significantly faster than a M AC-based design. 
In Chap. 3, comparisons will be made for specific filter design examples. 



2.7.2 Signed DA Systems 

In the following, we wish to discuss how (2.44) should be modified, in order to 
process a signed two’s complement number. In two’s complement, the MSB is 
used to distinguish between positive and negative numbers. For instance, from 
Table 2.1 (p. 35) we see that decimal —3 is coded as 101o = —4+0 + 1 = — 3io. 
We use, therefore, the following ( B + l)-bit representation 

B - 1 

x[n] = — 2 b x xb\p\ + Xb[n] x 2 6 . (2.48) 

6=0 

Combining this with (2.46), the outcome y is defined by: 



B — l N - 1 

V = -2 6 x f(c[n],x B [n]) + ^ 2 b x ^ / (c[n], x b [n]) . (2.49) 

6=0 n = 0 

To achieve the signed DA system we therefore have two choices to modify 
the unsigned DA system. They are 
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• An accumulator with add/subtract control 

• Using a ROM with one additional input 



Most often the switch able accumulator is preferred, because the additional 
input bit in the table requires a table with twice as many words. The following 
example demonstrates the processing steps for the add/sub switch design. 

Example 2.24: Signed DA Inner Product 

Consider again a third-order inner product defined by the convolution sum 

2 

y = (c, x) = ^2 c[ra]#[ra]. Assume that the data is given a N = Tbit two’s 

71—0 

complement encoding and that the coefficients are c[0] = -2, c[l] = 3, and 
c[2] = 1. The corresponding LUT table is given below: 



Xb[2\ £b[l] a?6[0] f(c[k],x[n]) 



0 


0 


0 


1 


X 


0+3 


X 


0- 


_2 


X 


0= 


O 10 


0 


0 


1 


1 


X 


0+3 


X 


0- 


-2 


X 


1= 


— 2io 


0 


1 


0 


1 


X 


0+3 


X 


1- 


-2 


X 


0= 


3io 


0 


1 


1 


1 


X 


0+3 


X 


1- 


-2 


X 


1= 


To 


1 


0 


0 


1 


X 


1+3 


X 


0- 


-2 


X 


0= 


To 


1 


0 


1 


1 


X 


1+3 


X 


0- 


-2 


X 


1= 


— To 


1 


1 


0 


1 


X 


1+3 


X 


1- 


-2 


X 


0= 


4io 


1 


1 


1 


1 


X 


1+3 


X 


1- 


-2 


X 


1= 


to 

0 



The values of x[k] are x[0] = To = 0001 2 c, :r[l] = — 3i 0 = 
x\2\ — 7io = 01 1 l' 2 c • The output at sample index k , namely y, 
follows: 



Step t 


x t [2\ x t [l] x t [ 0] 


f[t] x 2‘ +Y[t - 


i]=n*] 


0 


1 1 1 


2x2°+ 0 


= 2 


1 


1 0 0 


1 x 2 1 + 2 


= 4 


2 


1 1 0 


4 x 2 2 + 4 


= 20 




xt[2] £t[l] x t [ 0] 


1 

X 

to 

+~ 

Oh 

1 


l]=Y[f] 


3 


0 1 0 


-3 x 2 3 + 20 


= -4 


A numerical check results in c[0]x[0] -+• c[l]a?[l] + c\2]x[2] 
1x7= -4/ 


= -2x1 



IIOI2C, and 
is defined as 



+ 3 x (—3) + 

I 2.24 I 



2.7.3 Modified DA Solutions 

In the following we wish to discuss two interesting modifications to the ba- 
sic DA concept, where the first variation reduces the size, and the second 
increases the speed. 

If the number of coefficients N is too large to implement the full word 
with a single LUT (recall that input LUT bit width = number of coefficients), 
then we can use partial tables and add the results. If we also add pipeline 
registers, this modification will not reduce the speed, but can dramatically 
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reduce the size of the design, because the size of a LUT grows exponentially 
with the address space, i.e., the number of input coefficients N. Suppose the 
length LN inner product 

LN-l 

y = {c,x)= c[n]x[n] (2.50) 

n = 0 

is to be implemented using a DA architecture. The sum can be partitioned 
into L independent N th parallel DA LUTs resulting in 

L-1N-1 

y — (c, x) — c + n \ x [Ll + n\- (2.51) 

1 = 0 n — 0 

This is shown in Fig. 2.32 for a realization of a AN DA design requiring three 
post additional adders. The size of the table is reduced from one 2 4N x B LUT 
to four 2 n x B tables. 

Another variation of the DA architecture increases speed at the expense 
of additional LUTs, registers, and adders. A basic DA architecture, for a 
length iV th sum-of-product, computation, accepts one bit from each of N 
words. If two bits per word are accepted, then the computational speed can 
be essentially doubled. The maximum speed can be achieved with the fully 
pipelined word-parallel architecture shown in Fig. 2.33. Here, a new result 
of a length four sum-of-product is computed for 4-bit signed coefficients at 
each LUT cycle. For maximum speed, we have to provide a separate ROM 
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Fig. 2.33. Higher-order distributed arithmetic optimized for speed. 

(with identical content) for each bit vector Xb[n\. But the maximum speed 
can become expensive: If we double the input bit width, we need twice as 
many LUTs, adders and registers. If the number of coefficients N is limited 
to four or eight this modification gives attractive performance, essentially 
outperforming all commercially available programmable signal processors, as 
we will see in Chap. 3. 



2.8 Computation of Special Functions Using CORDIC 



If a digital signal processing algorithm is implemented with FPGAs and the 
algorithm uses a nontrivial (transcendental) algebraic function, like y/x or 
arctany/^, we can always use the Taylor series to approximate this function, 

i.e., 
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Table 2.10. CORDIC algorithm modes. 



Mode 


Angle 0 k 


Shift sequence 


Radius factor 


circular m = 1 


tan- 1 (2-' c ) 


<D 

to 


I<i = 1.65 


linear m = 0 


2-* 


1,2,... 


Ko = 1.0 


hyperbolic m = — 1 


tanh- 1 (2-' : ) 


1,2, 3, 4,4,... 


K - 1 = 0.80 



and the problem is reduced to a sequence of multiply and add operations. A 
more efficient, alternative approach, based on the Coordinate Rotation Dig- 
ital Computer (CORDIC) algorithm can also be considered. The CORDIC 
algorithm is found in numerous applications, such as pocket calculators [57], 
and in mainstream DSP objects, such as adaptive filters, FFTs, DCTs [58], 
demodulators [59], and neural networks [36]. The basic CORDIC algorithm 
can be found in two classic papers by Voider [60] and Walther [61]. Some 
theoretical extensions have been made, such as the extension of range in the 
hyperbolic mode, or the quantization error analysis by Hu et al. [62], and 
Meyer-Base et al. [59]. VLSI implementations have been discussed in Ph.D. 
theses, such as those by Timmermann [63] and Hahn [64]. The first FPGA 
implementations were investigated by Meyer-Base et al. [4, 59]. The realiza- 
tion of the CORDIC algorithm in distributed arithmetic was investigated by 
Ma [65]. A very detailed overview including details of several applications, 
was provided by Hu [58] in a 1992 IEEE Signal Processing Magazine review 
paper. 

The original CORDIC algorithm by Voider [60] computes a multiplier- 
free coordinate conversion between rectangular (x,y) and polar (R,0) coor- 
dinates. Walther [61] generalized the CORDIC algorithm to include circular 
(ra = 1), linear (m = 0), and hyperbolic (m = —1) transforms. For each 
mode, two rotation directions are identified. For vectoring , a vector with 
starting coordinates (Xo, Yo) is rotated in such a way that the vector finally 
lies on the abscissa (i.e. , x axis) by iteratively converging Yk to zero. For ro- 
tation, a vector with a starting coordinate (Ao,Yo) is rotated by an angle 0o 
in such a way that the final value of the angle register, denoted Y, converges 
to zero. The angle Ok is chosen so that each iteration can be performed with 
an addition and a binary shift. Table 2.10 shows, in the second column, the 
choice for the rotation angle for the three modes m = 1,0, and —1. 

Now we can formally define the CORDIC algorithm as follows: 
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Algorithm 2.25: CORDIC Algorithm 



At each iteration, the CORDIC algorithm implements the mapping: 



X k + 1 




. U+1 . 





1 

S k 2~ k 



mSk 2 
1 



A* 

Y k 



(2.53) 



Z k +i — Zk + Sk@k , 

where the angle 0 k is given in Table 2.10, S k = ±1, and the two rotation 
directions are Zk 0 and Yk 0. 



This means that six operational modes exist, and they are summarized in 
Table 2.11. A consequence is that nearly all transcendental functions can be 
computed with the CORDIC algorithm. With a proper choice of the initial 
values, the function A x Y, Y/A, sin(A), cos(Z), tan -1 (Y), sinh(A), cosh(A), 
and tanh (Z) can directly be computed. Additional functions may be gener- 
ated by choosing appropriate initialization, sometimes combined with multi- 
ple modes of operation, as shown in the following listing: 



tan(Z)=sin(Z)/ cos (Z) Modes: m — 1,0 
tanh(Z)=sinh(Z)/ cosh (Z) Modes: m — —1, 0 
exp(Z)=sinh(Z) +cosh(Z) Modes: m — —1; x — y — 1 
log e ( PY) =2 tanh~ 1 ( Y/ A ) Modes: m = -1 

with X = W + 1, Y = W - 1 
y/W = \JX 2 — Y 2 Modes: rn = 1 

with X = IT+ Y = W - 



A careful analysis of (2.53) reveals that the iteration vectors only approach 
the curves shown in Fig. 2.34a. The length of the vectors changes with each 
iteration, as shown in Fig. 2.34b. This change in length does not depend 
on the starting angle and after K iterations the same change (called radius 
factor) always occurs. In the last column of Table 2.10 these radius factors 
are shown. To ensure that the CORDIC algorithm converges, the sum of all 
remaining rotation angles must be larger than the actual rotation angle. This 
is the case for linear and circular transforms. For the hyperbolic mode, all 



Table 2.11. Modes m of operation for the CORDIC algorithm. 



m 


Zk — y 0 


Yk -* o 


1 


Xk = K\(Xo cos(Zo) — Y 0 sm(Zo ) ) 
Yk = K\(Xq cos(Zo) + Y 0 sin(Z 0 )) 


XK = K ly /X> + Y* 
Zk = Zo -f arctan(Vo/X 0 ) 


0 


Xk = X 0 

Yk = Vo + Ao x Zo 


X K = X 0 

Zk = Zo -|- Yo/ Xo 


-1 


Xk = K-i(Xo cosh(Zo) — Vo sinh(Zo)) 
Yk = K-i(Xq cosh(Zo) + Yo sinh(Zo)) 


x K = A'-iyx 0 2 + y 0 2 

Zk — Zo + tanh 1 ( Vo / Xo ) 
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(a) (b) 





Fig. 2.34. CORDIC. (a) Modes, (b) Example of circular vectoring. 



iterations of the form n^+i = 3 rik + 1 have to be repeated. These are the 
iterations 4, 13, 40, 121 .... 

Output precision can be estimated using a procedure developed by Hu 
[66] and illustrated in Fig. 2.35. The graph shows the effective bit precision 
for the circular mode, depending on the X,Y path width, and the number 
of iterations. If b bits is the desired output precision, the “rule of thumb” 
suggests that the X, Y path should have log 2 (&) additional guard bits. From 
Fig. 2.36, it can also be seen that the bit width of the Y path should have 
the same precision as that for X and Y. 

In contrast to the circular CORDIC algorithm, the effective resolution of 
a hyperbolic CORDIC cannot be computed analytically because the preci- 




Fig. 2.35. Effective bits in circular mode. 
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Z register width 

Fig. 2.36. Resolution of phase for circular mode. 



n i r 



[Hu92] 

Mean 
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sion depends on the angular values of z(k) at iteration k. Hyperbolic precision 
can, however, be estimated using simulation. Figure 2.37 shows the minimum 
accuracy estimate computed over 1000 test values for each bit- width/number 
combination of the possible iterations. The 3D representation shows the num- 
ber of iterations, the bit width of the X/Y r path, and the resulting minimum 
precision of the result in terms of effective bits. The contour lines allow an 
exchange between the number of iterations and the bit width. For example, 
to achieve 10-bit precision, one can use a 21-bit X/Y path and 18 iterations, 
or 14 iterations at 24 bits. 

2.8.1 CORDIC Architectures 

Two basic structures are used to implement a CORDIC architecture: the 
more compact state machine or the high-speed, fully pipelined processor. 

If computation time is not critical, then a state machine as shown in 
Fig. 2.38 is applicable. In each cycle, exactly one iteration of (2.53) will be 
computed. The most complex part of this design is the two barrelshifters. The 
two barrelshifters can be replaced by a single barrelshifter, using a multiplexer 
as shown in Fig. 2.39, or a serial (right, or right/left) shifter. Table 2.12 
compares different design options for a 13-bit implementation using Xilinx 
XC3K FPGAs. 

If high speed is needed, a fully pipelined version of the design shown 
in Fig. 2.40 can be used. Figure 2.40 shows eight iterations of a circular 
CORDIC. After an initial delay of K cycles, a new output value becomes 
available after each cycle. As with array multipliers, CORDIC implementa- 
tions have a quadratic growth in LE complexity as the bit width increases 
(see Table 2.10). 

The following example shows the first four steps of a circular-vectoring 
fully pipelined design. 
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Table 2.12. Effort estimation (Xilinx XC3K) for a CORDIC a machine with 13- 
bits plus sign for X/Y path. (Abbreviations: Ac=accumulator; BS=barrelshifter; 
RS=serial right shifter; LRS=serial left/right shifter) 



Structure 


Registers 


Multiplexer 


Adder 


Shifter 


ELE 


Cycle 


2BS+2Ac 


2x7 


0 


2x14 


2x19.5 


81 


12 


2RS+2Ac 


2x7 


0 


2x14 


2x6.5 


55 


46 


2LRS+2Ac 


2x7 


0 


2x14 


2x8 


58 


39 


lBS+2Ac 


7 


3x7 


2x14 


19.5 


75.5 


20 


lRS+2Ac 


7 


3x7 


2x14 


6.5 


62.5 


56 


lLRS+2Ac 


7 


3x7 


2x14 


8 


64 


74 


lBS+lAc 


3x7 


2x7 


14 


19.5 


68.5 


20 


lRS+lAc 


3x7 


2x7 


14 


6.5 


55.5 


92 


lLRS+lAc 


3x7 


2x7 


14 


8 


57 


74 




Fig. 2.39. CORDIC machine with reduced complexity. 



45°, arctan(2 _1 ) = 26.5°, and arctan(2- 2 ) = 14°. The VHDL code 6 for 8-bit 
data can be implemented as follows: 

PACKAGE eight_bit_int IS — User defined types 
SUBTYPE BYTE IS INTEGER RANGE -128 TO 127; 

TYPE ARRAY.BYTE IS ARRAY (0 TO 3) OF BYTE; 

END eight_bit_int ; 



The equivalent Verilog code cordic.v for this example can be found in Ap- 
pendix A on page 449. 
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X y 




Fig. 2.40. Fast CORDIC pipeline. 



LIBRARY work; 

USE work . eight_bit_int . ALL ; 

LIBRARY ieee ; 

USE ieee . std_logic_1164 . ALL; 

USE ieee. std_logic_arith. ALL; 

ENTITY cordic IS > Interface 

PORT (elk : IN STD.LOGIC; 

x_in , y_in : IN BYTE; 
r, phi, eps : OUT BYTE); 

END cordic; 

ARCHITECTURE flex OF cordic IS 

SIGNAL x, y, z : ARRAY_BYTE; — Array of Bytes 

BEGIN 
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PROCESS > Behavioral Style 

BEGIN 

WAIT UNTIL elk = ’1’; — Compute last value first in 
r <= x(3); — sequential VHDL statements !! 

phi <= z (3) ; 
eps <= y (3) ; 

IF y(2) > 0 THEN — Rotate 14 degrees 



x(3) <= x (2) + y (2) / 4 ; 
y (3) <= y (2) - x (2) / 4 ; 
z (3) <= z (2) + 14; 

ELSE 

x(3) <= x (2) - y (2) / 4 ; 
y (3) <= y (2) + x(2) /4; 
z (3) <= z (2) - 14; 

END IF; 

IF y(l) > 0 THEN — Rotate 26 degrees 

x(2) <= x ( 1 ) + y ( 1 ) /2 ; 
y (2) <= y (1) - x ( 1 ) /2 ; 
z (2) <= z ( 1 ) + 26; 

ELSE 

x(2) <= x ( 1 ) - y (1) /2 ; 
y (2) <= y (1 ) + x ( 1 ) /2 ; 
z (2) <= z ( 1 ) - 26; 

END IF; 

IF y (0) > 
x ( 1 ) <= 

y (i) <= 

z(l) <= 

ELSE 
x ( 1 ) <= 
y (1) <= 
z ( 1 ) <= 

END IF; 

— Test for x_in < 0 rotate 0,+90, or -90 degrees 
IF x_in > 0 THEN 

x(0) <= x_in; — Input in register 0 

y (0) <= y_in; 
z (0) <= 0; 

ELSIF y_in > 0 THEN 
x(0) <= y_in; 
y(0) <= - x_in; 
z (0) <= 90; 

ELSE 

x(0) <= - y_in; 
y(0) <= x_in; 
z (0) <= -90; 

END IF; 

END PROCESS; 



0 THEN — Rotate 45 degrees 

x (0) + y (0) ; 
y (0) - x (0) ; 
z (0) + 45; 

x (0) - y (0) ; 
y (0) + x (0) ; 
z (0) - 45; 
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Fig. 2.41. CORDIC simulation results. 



END flex; 

Figure 2.41 shows the simulation of the conversion of Xo = 215 = —41 mod 
256, and Vo = 55. Note that the radius is enlarged to R = Xk = 111 = 
1.618^/ Xq + Yq and the accumulated angle in degrees is arctan(Vo/Xo) = 
123°. The design requires 244 LCs and runs with a fast synthesis selection 
at 39.68 MHz using I/O cell registers. | 2 . 2 s | 



The actual LC count in the previous example is larger than that expected 
for a four-stage 8-bit pipeline design that is 5 x 8 x 3 = 120 LCs. The increase 
by a factor of two comes from the fact that a FLEX device uses an TV-bit 
switchable LPM_ADD_SUB megafunction that needs 2TVLCs. It needs 2TV LCs 
because the LC has only three inputs in the fast arithmetic mode, and the 
switch mode needs four input LUTs. A Xilinx XC4K series device would be 
needed, with four inputs per LC, to reduce the count by a factor of two. 



Exercises 

2 . 1 : Wallace has introduced an alternative scheme for a fast multiplier. The basic 
building block of this type of multiplier is a carry-save adder (CSA). A CSA takes 
three n-bit operands and produces two n-bit outputs. Because there is no propaga- 
tion of the carry, this type of adder is sometimes called a 3:2 compress or counter. 
For an n x n-bit multiplier we need a total of n — 2 CSAs to reduce the output to 
two operands. These operands then have to be added by a (fast) 2n-bit ripple-carry 
adder to compute the final result of the multiplier. 

(a) The CSA computation can be done in parallel. Determine the minimum num- 
ber of levels for an n x rc-bit multiplier with n £ [0, 16]. 

(b) Explain why, for FPGAs with fast two’s complement adders, these multipliers 
are not more attractive than the usual array multiplier. 
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(c) Explain how a pipelined adder in the final adder stage can be used to implement 
a faster multiplier. Use the data from Table 2.7 (p. 54) to estimate the necessary 
LC usage and possible speed for 
(cl) 8 x 8-bit multiplier. 

(c2) 12 x 12-bit multiplier. 



2.2: The Booth multiplier used the classical CSD code to reduce the number of 
necessary add/subtract operations. Starting with the LSB, typically two or three 
bits (called radix-4 and radix-8 algorithms) are processed in one step. The following 
table demonstrates possible radix-4 patterns and actions: 



Xk+\ 


Xk 


Xk- 1 


Accumulator activity 


Comment 


0 


0 


0 


ACC— »ACC +R* (0) 


within a string of “0s” 


0 


0 


1 


ACC— »ACC +R* (X) 


end of a string of “Is” 


0 


1 


0 


ACC— »ACC +R* (A') 




0 


1 


1 


ACC— >• ACC +R* (2X) 


end of a string of “Is” 


1 


0 


0 


ACC— >- ACC -fR* (-2X) 


beginning of a string of “Is” 


1 


0 


1 


ACC— >- ACC -f R* ( -X ) 




1 


1 


0 


ACC— >• ACC -fR* (-X) 


beginning of a string of “Is” 


1 


1 


1 


ACC-»ACC -fR* (0) 


within a string of “Is” 



The hardware requirements for a state machine implementation are an accu- 
mulator and a two’s complement shifter. 

(a) Let X be a signed 6-bit two’s complement representation of —10 = 1101102c- 
Complete the following table for the Booth product P = X Y = — 10Y and indicate 
the accumulator activity in each step. 

Step xs X4 X3 X 2 x\ x$ x — \ ACC ACC -f Booth rule 

Start 110 110 0 

0 ( 2 * 54 ) 

1 
2 

(b) Compare the latency of the Booth multiplier, with the serial/parallel multiplier 
from Example 2.17 (p. 58), for the radix-4 and radix-8 algorithms. 



Exercises Using MaxPlusII 

2.3: (a) Compile the HDL file add_2p with the MaxPlusII compiler with optimiza- 
tion for speed and area. How many LCs are needed? Explain the results. 

(b) Conduct a simulation with 15 -f 102. 



2.4: Explain how to modify the HDL design add_lp for subtraction. 

(a) Modify the design and simulate as an example 

(b) 3 - 2 and 

(0 2 - 3 - 

(d) Add an asynchonous set to the carry flip-flop to avoid initial wrong sum values. 
Simulate again 3 — 2. 
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2.5: (a) Compile the HDL file mul_ser with the MaxPlusII compiler. 

(b) Determine the Registered Performance and size of the 8-bit design. What is 
the total multiplication latency? 



2.6: Modify the HDL design file mul_ser to multiply 12 x 12-bit numbers. 

(a) Simulate the new design with the values 1000 x 2000. 

(b) Measure the Registered Performance and the size. 

(c) What is the total multiplication latency of the 12 x 12-bit multiplier? 

2.7: (a) Design a state machine in MaxPlusII to implement the Booth multiplier 
(see Exercise 2.2) for 6x6 bit signed inputs. 

(b) Simulate the four data =L5 x (±9). 

(c) Determine the Registered Performance. 

(d) Determine LC utilization for maximum speed. 

2.8: (a) Design a generic CSA that is used to build a Wallace-tree multiplier for 
an 8 x 8-bit multiplier. 

(b) Implement the 8x8 Wallace tree using MaxPlusII. 

(c) Use a final adder to compute the product, and test your multiplier with a 
multiplication of 100 x 63. 

(d) Pipeline the Wallace tree. What is the maximum throughput of the pipelined 
design? 

(e) Substitute the 16- bit output adder with the pipeline adder from the CD-ROM. 
What is the Registered Performance of this design? 

2.9: (a) Use the principle of component instantiation, using the predefined macros 
LPM_ADD_SUB and LPM_MULT, to write the VHDL code for a pipelined complex 8-bit 
multiplier, (i.e., (a + jb)(c + jd) = ac — bd - f j(ad + 6c)), with all operands a, 6, c, 
and d in 8-bit. 

(b) Determine the Registered Performance. 

(c) Determine LC utilization for maximum speed synthesis. 

(d) How many pipeline stages does the optimal single LPM_MULT multiplier have? 

(e) How many pipeline stages does the optimal complex multiplier have in total? 



2 . 10 : An alternative algorithm for a complex multiplier is: 

s[ 1] = a — b s[2] = c — d s[3] = c + d 

m[l] = s[l]d m[ 2] = s[2]a m[3] = s[3]b 

s[4] = m[ 1] + m[2] s[5] - m[l] + m[3] (2 ' 55) 

(a + j6)(c + jd) = 5 [4] + js[5] 

which, in general, needs five adders and three multipliers. Verify that if one coeffi- 
cient, say c + jd is known, then s[2],s[3], and d can be prestored and the algorithm 
reduces to three adds and three multiplications. Also 

(a) Design a pipelined 5/3 complex multiplier using the above algorithm for 8-bit 
signed inputs. Use the predefined macros LPM_ADD_SUB and LPM_MULT. 

(b) Measure Registered Performance and size for maximum speed synthesis. 

(c) How many pipeline stages does the single LPM_MULT multiplier have? 

(d) How many pipeline stages does the complex multiplier have in total? 

2.11: Compile the HDL file cordic with the MaxPlusII compiler, and 
(a) Conduct a simulation (using the waveform file cordic. scf) with x_in=±30 
and y_in=d=55. Determine the radius factor for all four simulations. 
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(b) Determine the maximum errors for radius and phase, compared with an un- 
quantized computation. 



2.12: Modify the HDL design cordic to implement stages 4 and 5 of the CORDIC 
pipeline. 

(a) Compute the rotation angle, and compile the VHDL code. 

(b) Conduct a simulation with values x_in=±30 and y_in=±55. 

(c) What are the maximum errors for radius and phase, compared with the un- 
quantized computation? 

2.13: Consider a floating-point representation with a sign bit, E = 7- bit exponent 
width, and M = 10 bits for the mantissa (not counting the hidden one). 

(a) Compute the bias using (2.24) p. 47. 

(b) Determine the (absolute) largest number that can be represented. 

(c) Determine the (absolutely measured) smallest number (not including denor- 
mals) that can be represented. 



2.14: Using the result from Exercise 2.13 

(a) Determine the representation of /i = 9.25i 0 in this (1,7,10) floating-point for- 
mat. 

(b) Determine the representation of /2 = — 10.5io in this (1,7,10) floating-point, 
format. 

(c) Compute fi + using floating-point arithmetic. 

(d) Compute f\ * /2 using floating-point arithmetic. 

(e) Compute / 1//2 using floating-point arithmetic. 



2.15: For the IEEE single precision format (see Table 2.5, p. 50) determine the 
32-bit representation of: 

(a) f\ = -0. 

(b) / 2 = 00 . 

(c) f 3 = 9.25 10 . 

(d) f A = -10.5io. 

( e ) — 0-lio • 

(f) f 6 - tt = 3.141593io. 

(g) h = v / 3/2 = 0.8660254 lo . 

2.16: Compile the HDL file div_res from Example 2.18 (p. 67) to divide two num- 
bers. 

(a) Simulate the design with the values 234/3. 

(b) Simulate the design with the values 234/1. 

(c) Simulate the design with the values 234/0. Explain the result. 

2.17: Design a nonperforming divider based on the HDL file div_res from Example 

2.18 (p. 67). 

(a) Simulate the design with the values 234/50 as shown in Fig. 2.19, p. 70. 

(b) Measure Registered Performance, size and latency for maximum speed syn- 
thesis. 

2.18: Design a nonrestoring divider based on the HDL file div_res from Example 

2.18 (p. 67). 

(a) Simulate the design with the values 234/50 as shown in Fig. 2.20, p. 71. 

(b) Measure Registered Performance, size and latency for maximum speed syn- 
thesis. 
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Fig. 2.42. Test bench for the barrel shifter. 



2.19: Shift operations are usually implemented with a barrelshifter, which can be 
inferred in VHDL via the SLL instruction. Unfortunately, the SLL is not supported 
in Altera’s MaxPlusII because it is part of the 1993 standard and is not in the 
1987 standard, but we can design a barrelshifter in many different ways. We wish 
to design 12-bit barrelshifters, that have the following entity: 



ENTITY lshift IS > Interface 

GENERIC (W1 : INTEGER := 12; — data bit width 
W2 : integer := 4); — ceil (log2(Wl) ) ; 

PORT (elk : IN STD.LOGIC; 

distance : IN STD_L0GIC_VECT0R (W2-1 D0WNT0 0); 
data : IN STD_L0GIC_VECT0R (Wl-1 D0WNT0 0) ; 
result : OUT STD_L0GIC_VECT0R (Wl-1 D0WNT0 0)); 

END; 

that should be verified via the simulation shown in Fig. 2.42. Use input and output 



registers for data and result, no register for the distance. 

(al) Use a PROCESS and (sequentially) convert each bit of the distance vector in an 
equivalent power-of-two constant multiplication. Use lshift as entity name. 

(a2) Measure the Registered Performance and the size. 

(bl) Use a PROCESS and shift (in a loop) the input data always 1 bit only, until 
loop counter and distance show the same value. Then transfer the shifted data to 
the output register. Use lshiftloopas entity name. 

(b2) Measure the Registered Performance and the size. 

(cl) Use a PROCESS environment and “demux” with a loop statement the distance 
vector in an equivalent multiplication factor. Then use a single (array) multiplier 
to perform the multiplication. Use lshif tdemux as the entity name. 

(c2) Measure the Registered Performance and the size. 

(di) Use a PROCESS environment and convert with a case statement the distance 
vector in an equivalent multiplication factor. Then use a single (array) multiplier 
to perform the multiplication. Use lshif tmul as entity name. 

(d2) Measure the Registered Performance and the size. 

(el) Use the lpm_clshift megafunction to implement the 12-bit barrelshifter. Use 
lshif tlpm as entity name. 

(e2) Measure the Registered Performance and the size. 

(d) Compare all 5 barrelshifter designs in terms of Registered Performance, size, 
and design- reuse, i.e., effort to change data width and the use of other software 
than MaxPlusII. 





3. Finite Impulse Response (FIR) Digital 
Filters 



3.1 Digital Filters 

Digital filters are typically used to modify or alter the attributes of a signal 
in the time or frequency domain. The most common digital filter is the linear 
time-invariant (LTI) filter. An LTI interacts with its input signal through a 
process called linear convolution, denoted by y = / * x where / is the filter’s 
impulse response, x is the input signal, and y is the convolved output. The 
linear convolution process is formally defined by: 

y[n\ = x[n\ * f[n\ = ^ x[k]f[n - k] = f[k]x[n - k\. (3.1) 

k k 

LTI digital filters are generally classified as being finite impulse response 
(i.e., FIR), or infinite impulse response (i.e., HR). As the name implies, an 
FIR filter consists of a finite number of sample values, reducing the above 
convolution sum to a finite sum per output sample instant. An HR filter, 
however, requires that an infinite sum be performed. An FIR design and 
implementation methodology is discussed in this chapter, while IIR filter 
issues are addressed in Chap. 4. 

The motivation for studying digital filters is found in their growing popu- 
larity as a primary DSP operation. Digital filters are rapidly replacing classic 
analog filters, which were implemented using RLC components and opera- 
tional amplifiers. Analog filters were mathematically modeled using ordinary 
differential equations of Laplace transforms. They were analyzed in the time 
or s (also known as Laplace) domain. Analog prototypes are now only used 
in IIR design, while FIR are typically designed using direct computer speci- 
fications and algorithms. 

In this chapter it is assumed that a digital filter, an FIR in particular, 
has been designed and selected for implementation. The FIR design process 
will be briefly reviewed, followed by a discussion of FPGA implementation 
variations. 
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Fig. 3.1. Direct form FIR filter. 



3.2 FIR Theory 

An FIR with constant coefficients is an LTI digital filter. The output of an 
FIR of order or length L, to an input time-series x[n], is given by a finite 
version of the convolution sum given in (3.1), namely: 

L — l 

y[n] = x[n] * f[n] = f[k]x[n - k], (3.2) 

k = 0 

where /[ 0] / 0 through f[L — 1] / 0 are the filter’s L coefficients. They also 
correspond to the FIR’s impulse response. For LTI systems it is sometimes 
more convenient to express (3.2) in the z-domain with 

Y(z) = F(z)X(z), (3.3) 

where F(z ) is the FIR’s transfer function defined in the z-domain by 

F(z) = J2mz- k . (3.4) 

k - 0 

The L th -order LTI FIR filter is graphically interpreted in Fig. 3.1. It can 
be seen to consist of a collection of a “tapped delay line,” adders, and multi- 
pliers. One of the operands presented to each multiplier is an FIR coefficient, 
often referred to as a “tap weight” for obvious reasons. Historically, the FIR 
filter is also known by the name “transversal filter,” suggesting its “tapped 
delay line” structure. 

The roots of polynomial F(z) in (3.4) define the zeros of the filter. The 
presence of only zeros is the reason that FIRs are sometimes called all zero 
filters . In Chap. 5 we will discuss an important class of FIR filters (called 
CIC filters) that are recursive but also FIR. This is possible because the poles 
produced by the recursive part are canceled by the nonrecursive part of the 
filter. The effective pole/zero plot also then has only zeros, i.e., is an all-zero 
filter or FIR. We note that nonrecursive filters are always FIR, but recursive 
filters can be either FIR or HR. Figure 3.2 illustrates this dependence. 
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Fig. 3.2. Relation between structure and impulse length. 



3.2.1 FIR Filter with Transposed Structure 

A variation of the direct FIR model is called the transposed FIR filter . It can 
be constructed from the FIR filter in Fig. 3.1 by: 

• Exchanging the input and output 

• Inverting the direction of signal flow 

• Substituting an adder by a fork, and vice versa 

A transposed FIR filter is shown in Fig. 3.3 and is, in general, the preferred 
implementation of an FIR filter. The benefit of this filter is that we do not 
need an extra shift register for x[n], and there is no need for an extra pipeline 
stage for the adder (tree) of the products to achieve high throughput. 

The following examples show a direct implementation of the transposed 
filter. 

Example 3.1: Programmable FIR Filter 

We recall from the discussion of sum-of-product (SOP) computations using a 
PDSP (see Sect. 2.7, p. 87) that, for B x data/coefficient bit width and filter 
length L, additional log 2 (L) bits for unsigned SOP and log 2 (L) — 1 guard bits 
for signed arithmetic must be provided. For a Tbit signed data/ coefficient 
and L = 4, the adder width must be 9 + 9 + log 2 (4) — 1 = 19. 




Fig. 3.3. FIR filter in the transposed structure. 
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The following VHDL code 2 shows the generic specification for an implemen- 
tation for a length-4 filter. 

— This is a generic FIR filter generator 
— It uses W1 bit data/coefficients bits 

LIBRARY 1pm; — Using predefined packages 

USE lpm. lpm_ components. ALL; 

LIBRARY ieee; 

USE ieee . std_logic_1164. ALL; 

USE ieee . std_logic_arith. ALL ; 

USE ieee . std_logic_unsigned. ALL ; 

ENTITY fir_gen IS > Interface 

GENERIC (W1 : INTEGER := 9; — Input bit width 

W2 : INTEGER := 18; — Multiplier bit width 2*W1 

W3 : INTEGER := 19;— Adder width = W2+log2(L)-l 

W4 : INTEGER := 11; — Output bit width 

L : INTEGER := 4; — Filter length 

Mpipe : INTEGER := 3 — Pipeline steps of multiplier 

); 

PORT ( elk : IN STD.LOGIC; 

Load.x : IN STD.LOGIC; 

x_in : IN STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 

c_in : IN STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 

y.out : OUT STD_L0GIC_VECT0R (W4-1 DOWNTO 0)); 

END fir_gen; 

ARCHITECTURE flex OF fir.gen IS 

SUBTYPE N1BIT IS STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 

SUBTYPE N2BIT IS STD_L0GIC_VECT0R(W2-1 DOWNTO 0); 

SUBTYPE N3BIT IS STD_L0GIC_VECT0R(W3-1 DOWNTO 0); 

TYPE ARRAY.N1BIT IS ARRAY (0 TO L-l) OF N1BIT ; 

TYPE ARRAY_N2BIT IS ARRAY (0 TO L-l) OF N2BIT ; 

TYPE ARRAY.N3BIT IS ARRAY (0 TO L-l) OF N3BIT ; 

SIGNAL x : N1BIT ; 

SIGNAL y : N3BIT ; 

SIGNAL c : ARRAY_N1BIT; — Coefficient array 

SIGNAL p : ARRAY_N2BIT; — Product array 

SIGNAL a : ARRAY_N3BIT; — Adder array 

BEGIN 

Load: PROCESS > Load data or coefficient 

BEGIN 

WAIT UNTIL elk = ’1’ ; 

IF (Load.x = >0’) THEN 

c(L-l) <= c_in; — Store coefficient in register 

FOR I IN L-2 DOWNTO 0 LOOP — Coefficients shift one 
c(I) <= c(I+l) ; 

2 The equivalent Verilog code fir_gen.v for this example can be found in Ap- 
pendix A on page 451. 
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END LOOP; 

ELSE 

x <= x_in; — Get one data sample at a time 

END IF; 

END PROCESS Load; 



SOP: PROCESS (elk) > Compute sum-of -products 

BEGIN 

IF elk ’ event and (elk = > 1 >) THEN 

FOR I IN 0 TO L-2 LOOP — Compute the transposed 

a(I) <= (p(I) (W2-1) & p(I) ) + a(I+l); — filter adds 
END LOOP; 

a(L-l) <= p(L-l) (W2-1) & p(L-l); — First TAP has 

END IF; — only a register 

y <= a(0) ; 

END PROCESS SOP; 

— Instantiate L pipelined multiplier 
MulGen: FOR I IN 0 TO L-l GENERATE 

Muls: lpm_mult — Multiply p(i) = c(i) * x; 

GENERIC MAP ( LPM.WIDTHA => Wl, LPM.WIDTHB => Wl, 
LPM.PIPELINE => Mpipe , 
LPM.REPRESENTATION => "SIGNED", 
LPM_WIDTHP => W2 , 

LPM.WIDTHS => W2) 

PORT MAP ( clock => elk, dataa => x, 

datab => c(I), result => p(I)); 

END GENERATE; 

y_out <= y (W3-1 DOWNTO W3-W4) ; 

END flex; 



The first process, Load, is used to load the coefficient in a tapped delay line if 
Load_x=0 . Otherwise, a data word is loaded in the x register. The second pro- 
cess called SOP, implements the sum-of-product computation. The products 
p(I) are sign-extended by one bit and added to the previous partial SOP. 
Note also that all multipliers are instantiated by a generate statement, which 
allows the assignment of extra pipeline stages. Finally, the output y_out is 
assigned the value of the SOP, divided by 256, because the coefficients are 
assumed to be all fractional (i.e., \ f[k]\ < 1.0) The design uses 892 LCs and 
runs with 41.66 MHz Registered Performance. 

To simulate this length-4 filter consider a Daubechies DB4 filter coefficient 
with 

G(z ) = ((1 -p a/ 3) + (3 -f V^3 )z -+- (3 — a/3 )z -p (1 — a/3 )z 3 ) 

G(z) = 0.48301 + O.83650 -1 + O.22410 -2 - 0.1294^ -3 . 

Quantizing the coefficients to 8 bits (plus sign bit) of precision results in the 
following model: 

G(z) = (l24 + 214z _1 + 57^“ 2 - 33* -3 ) /256 
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Fig. 3.4. Simulation of the 4- tap programmable FIR filter with Daubechies filter 
coefficient loaded. 



= 124 214 i _5 7_ z _ 2 _ S3_ z -s 

~ 256 + 256* 256 Z 256* 

As can be seen from Fig. 3.4, in the first four steps we load the coefficients 
{124, 214, 57, —33} into the tapped delay line. Note that MaxPlusII diplays 
—33 as unsigned number, i.e., 512 — 33 = 479. Then we check the impulse 
response of the filter by loading 100 into the x register. The first valid output 
is then available after 450 ns, as can be seen from Fig. 3.4. | 3.1 | 



3.2.2 Symmetry in FIR Filters 

The center of an FIR’s impulse response is an important point of symmetry. 
It is sometimes convenient to define this point as the 0 th sample instant. Such 
filter descriptions are a-causal (centered notation). For an odd-length FIR, 
the a-causal filter model is given by: 

(L-l)/2 

F{z)= Y, /[*]*“*■ M 

k — —(L— 1 ) / 2 

The FIR’s frequency response can be computed by evaluating the filter’s 
transfer function about the periphery of the unity circle, by setting z — . 

It then follows that: 

F{lj) = F(e >wT ) = Y f[k]e~ lwkT - (3.6) 

k 

We then denote with \F(lj) \ the filter’s magnitude frequency response and 
denotes the phase response , and satisfies: 

= arc,an (!?m)) ■ (3j > 

Digital filters are more often characterized by phase and magnitude than 
by the z-domain transfer function or the complex frequency transform. 
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Table 3.1. Four possible linear-phase FIR filters F(z) = ^ f[k]z k . 
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3.2.3 Linear-phase FIR Filters 



Maintaining phase integrity across a range of frequencies is a desired system 
attribute in many applications such as communications and image processing. 
As a result, designing filters that establish linear-phase versus frequency is 
often mandatory. The standard measure of the phase linearity of a system is 
the “group delay” defined by: 



r(w) 



d(j)(uj) 

du; 



(3.8) 



A perfectly linear-phase filter has a group delay that is constant over a 
range of frequencies. It can be shown that linear-phase is achieved if the 
filter is symmetric or antisymmetric, and it is therefore preferable to use the 
a-causal framework of (3.5). From (3.7) it can be seen that a constant group 
delay can only be achieved if the frequency response F(uj) is a purely real or 
imaginary function. This implies that the filter’s impulse response possesses 
even or odd symmetry. That is: 



/ M = f[~n] or f[n] = - f[-n ] 



(3.9) 



An odd-order even-symmetry FIR filter would, for example, have a fre- 
quency response given by: 



F(w) = /[0] + iW e ~ j/! “ T + f[-k]j kuT (3.10) 

k 

= /[0] + 2 E f[k]cos(kujT), (3-11) 

k> o 

which is seen to be a purely real function of frequency. Table 3.1 summarizes 
the four possible choices of symmetry, antisymmetry, even order and odd 
order. In addition, Table 3.1 graphically displays an example of each class of 
linear-phase FIR. 
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Fig. 3.5. Linear- phase filter with reduced number of multipliers. 



The symmetry properties intrinsic to a linear-phase FIR can also be used 
to reduce the necessary number of multipliers L, as shown in Fig. 3.1. Con- 
sider the linear-phase FIR shown in Fig. 3.5 (even symmetry assumed), which 
fully exploits coefficient symmetry. Observe that the “symmetric” architec- 
ture has a multiplier budget per filter cycle exactly half of that found in the 
direct architecture shown in Fig. 3.1 ( L versus Lj 2) while the number of 
adders remains constant at L — 1. 



3.3 Designing FIR Filters 

Modern digital FIR filters are designed using computer-aided engineering 
(CAE) tools. The filters used in this chapter are designed using the Mat Lab 
Signal Processing toolbox. The toolbox includes an “Interactive Lowpass Fil- 
ter Design” demo example that covers many typical digital filter designs, 
including: 

• Equiripple (also known as minimax) FIR design, which uses the Parks- 
McClellan and Remez exchange methods for designing a linear-phase (sym- 
metric) equiripple FIR. This equirriple design may also be used to design 
a differentiator or Hilbert transformer. 

• Kaiser window design using the inverse DFT method weighted by a Kaiser 
window. 

• Least square FIR method. This filter design also has ripple in the passband 
and stopband, but the mean least square error is minimized. 

• Four HR filter design methods (Butterworth, Chebyshev I and II, and 
elliptic) which will be discussed in Chap. 4. 
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The FIR methods are individually developed in this section. Most often we 
already know the transfer function (i.e., magnitude of the frequency response) 
of the desired filter. Such a lowpass specification typically consists of the 
passband [0 . . . u; p ], the transition band [u; p . . .cj s ], and the stopband [uj s . . . 7r] 
specification, where the sampling frequency is assumed to be 2tt. To compute 
the filter coefficients we may therefore apply the direct frequency method 
discussed next. 

3.3.1 Direct Window Design Method 

The discrete Fourier transform (DFT) establishes a direct connection between 
the frequency and time domains. Since the frequency domain is the domain 
of filter definition, the DFT can be used to calculate a set of FIR filter 
coefficients that produce a filter that approximates the frequency response of 
the target filter. A filter designed in this manner is called a direct FIR filter. 
A direct FIR filter is defined by: 

f[n] = IDFT( J F[fe]) = F[k]^ 2vkn/L . (3.12) 

k 

From basic signals and systems theory, it is known that the spectrum of 
a real signal is Hermitian. That is, the real spectrum has even symmetry and 
the imaginary spectrum has odd symmetry. If the synthesized filter should 
have only real coefficients, the target DFT design spectrum must therefore 
be Hermitian or F[k] = F*[—k], where the * denotes conjugate complex. 

Consider a length- 16 direct FIR filter design with a rectangular window, 
shown in Fig. 3.6a, with the passband ripple shown in Fig. 3.6b. Note that 
the filter provides a reasonable approximation to the ideal lowpass filter with 
the greatest mismatch occurring at the edges of the transition band. The 
observed “ringing” is due to the Gibbs phenomenon, which relates to the 
inability of a finite Fourier spectrum to reproduce sharp edges. The Gibbs 
ringing is implicit in the direct inverse DFT method and can be expected to 
be about ±7% over a wide range of filter orders. To illustrate this, consider 
the example filter with length 128, shown in Fig. 3.6c, with the passband 
ripple shown in Fig. 3.6d. Although the filter length is essentially increased 
(from 16 to 128) the ringing at the edge still has about the same quantity. 
The effects of ringing can only be suppressed with the use of a data “window” 
that tapers smoothly to zero on both sides. Data windows overlay the FIR’s 
impulse response, resulting in a “smoother” magnitude frequency response 
with an attendant widening of the transition band. If, for instance, a Kaiser 
window is applied to the FIR, the Gibbs ringing can be reduced as shown in 
Fig. 3.7(upper). The deleterious effect on the transition band can also be seen. 
Other classic window functions are summarized below. They differ in terms 
of their ability to make tradeoffs between “ringing” and transition bandwidth 
extension. The number of recognized and published window functions is large. 
The most common windows, denoted w[n\, are: 
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Fig. 3.6. Gibbs phenomenon, (a) Impulse response of FIR lowpass with L = 16. 
(b) Passband of transfer function L = 16. (c) Impulse response of FIR lowpass 
with L = 128. (d) Passband of transfer function L = 128. 



• Rectangular: w[n] = 1 

• Bartlett (triangular) : w[n] = 2 n/N 

• Hanning: w[n] = 0.5 (1 — cos(27rn/L) 

• Hamming: w[n] = 0.54 — 0.46 cos(27rn/L) 

• Blackman: w[n\ — 0.42 — 0.5 cos(27m/L) + 0.08 cos(47rn/L) 

• Kaiser: w[n] = Io — (n — L/2) 2 /(L/2) 2 ^ 

Table 3.2 shows the most important parameters of these windows. 

The 3-dB bandwidth shown in Table 3.2 is the bandwidth where the 
transfer function is decreased from DC by 3 dB or ra l/>/2. Data windows 
also generate sidelobes, to various degrees, away from the 0 th harmonic. De- 




3.3 Designing FIR Filters 119 



Table 3.2. Parameters of commonly used window functions. 



Name 


3-dB 

band- 

width 


First 

zero 


Maximum 

sidelobe 


Sidelobe 
decrease 
per octave 


Equivalent 

Kaiser 

P 


Rectangular 


0.89 /T 


1/T 


-13 dB 


-6 dB 


0 


Bartlett 


1.28 /T 


2/T 


-27 dB 


-12 dB 


1.33 


Hanning 


1.44/T 


2/T 


-32 dB 


-18 dB 


3.86 


Hamming 


1.33 /T 


2/T 


-42 dB 


-6 dB 


4.86 


Blackman 


1.79 /T 


3/T 


-74 dB 


-6 dB 


7.04 


Kaiser 


1.44/T 


2/T 


-38 dB 


-18 dB 


3 



pending on the smoothness of the window, the third column in Table 3.2 
shows that some windows do not have a zero at the first or second zero DFT 
frequency 1/T. The maximum sidelobe gain is measured relative to the 0 th 
harmonic value. The fifth column describes the asymptotic decrease of the 
window per octave. Finally, the last column describes the value (3 for a Kaiser 
window that emulates the corresponding window properties. The Kaiser win- 
dow, based on the first-order Bessel function 7 q, is special in two respects. It 
is nearly optimal in terms of the relationship between “ringing” suppression 
and transition width, and second, it can be tuned by /?, which determines the 
ringing of the filter. This can be seen from the following equation credited to 
Kaiser. 

( 0.1102(^-8.7) ,4 >50, 

/ 3 = < 0.5842 {A - 21) 0,4 + 0.07886(T - 21) 21 < A < 50, (3.13) 

I 0 A <21, 



where A — 201og 10 £ r is both stopband attenuation and the passband ripple 
in dB. The Kaiser window length to achieve a desired level of suppression 
can be estimated: 



L = 



A- 8 

2.285(cj s — u; p ) 



+ 1 . 



(3.14) 



The length is generally correct within an error of ±2 taps. 



3.3.2 Equiripple Design Method 

A typical filter specification not only includes the specification of passband 
cj p and stopband cu s frequencies and ideal gains, but also the allowed devi- 
ation (or ripple) from the desired transfer function. The transition band is 
most often assumed to be arbitrary in terms of ripples. A special class of FIR 
filter that is particularly effective in meeting such specifications is called the 
equiripple FIR. An equiripple design protocol minimizes the maximal devia- 
tions (ripple error) from the ideal transfer function. The equiripple algorithm 
applies to a number of FIR design instances. The most popular are: 
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(a) (b) (c) 





(a) (b) (c) 




Fig. 3.7. (upper) Kaiser window design with L = 59. (lower) Parks-McClellan 
design with L = 27. 

(a) Transfer function, (b) Group delay of passband. (c) Zero plot. 



• Lowpass filter design (in MatLab use remez(L,F, A,W)), with tolerance 
scheme as shown in Fig. 3.8a 

• Hilbert filter, i.e., a unit magnitude filter that produces a 90° phase shift 
for all frequencies in the passband (in MatLab use remez(L, F, A, 
’Hilbert’ ) 

• Differentiator filter that has a linear increasing frequency magnitude pro- 
portional to Lo (in MatLab use remez(L,F,A, ’differentiator’) 

The equiripple or minimum-maximum algorithm is normally implemented 
using the Parks-McClellan iterative method. The Parks-McClellan method 
is used to produce a equiripple or minimax data fit in the frequency domain. 
It is based on the “alternation theorem” that says that there is exactly one 
polynomial, a Chebyshev polynomial with minimum length, that fits into a 
given tolerance scheme. Such a tolerance scheme is shown in Fig. 3.8a, and 
Fig. 3.8b shows a polynomial that fulfills this tolerance scheme. The length 
of the polynomial, and therefore the filter, can be estimated for a lowpass 
with 
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Fig. 3.8. Parameters for the filter design, (a) Tolerance scheme (b) Example 
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(3.15) 



where e p is the passband and £ s the stopband ripple. 

The algorithm iteratively finds the location of locally maximum errors 
that deviate from a nominal value, reducing the size of the maximal error 
per iteration, until all deviation errors have the same value. Most often, the 
Remez method is used to select the new frequencies by selecting the frequency 
set with the largest peaks of the error curve between two iterations, see [67, 
p. 478]. This is why the MatLab equiripple function is called remez. 

Compared to the direct frequency method , with or without data windows, 
the advantage of the equiripple design method is that passband and stopband 
deviations can be specified differently. This may, for instance, be useful in 
audio applications where the ripple in the passband may be specified to be 
higher, because the ear only perceives differences larger than 3 dB. 

We note from Fig. 3.7(lower) that the equiripple design having the same 
tolerance requirements as the Kaiser window design enjoys a considerably 
reduced filter order, i.e., 27 compared with 59. 



3.4 Constant Coefficient FIR Design 

There are only a few applications (e.g., adaptive filters) where we need a 
general programmable filter architecture like the one shown in Example 3.1 
(p. 111). In many applications, the filters are LTI (i.e., linear time invariant) 
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and the coefficients do not change over time. In this case, the hardware ef- 
fort can essentially be reduced by exploiting the multiplier and adder (trees) 
needed to implement the FIR filter arithmetic. 

With available digital filter design software the production of FIR coef- 
ficients is a straightforward process. The challenge remains to map the FIR 
design into a suitable architecture. The direct or transposed forms are pre- 
ferred for maximum speed and lowest resource utilization. Lattice filters are 
used in adaptive filters because the filter can be enlarged by one section, 
without the need for recomputation of the previous lattice sections. But this 
feature only applies to PDSPs and is less applicable to FPGAs. We will 
therefore focus our attention on the direct and transposed implementations. 
We will start with possible improvements to the direct form and will then 
move on to the transposed form. At the end of the section we will discuss an 
alternative design approach using distributed arithmetic. 

3.4.1 Direct FIR Design 

The Direct FIR filter shown in Fig. 3.1 (p. 110) can be implemented in VHDL 
using (sequential) PROCESS statements or by “component instantiations” of 
the adders and multipliers. A PROCESS design provides more freedom to the 
synthesizer, while component instantiation gives full control to the designer. 
To illustrate this, a length-4 FIR will be presented as a PROCESS design. Al- 
though a length-4 FIR is far too short for most practical applications, it is 
easily extended to higher orders and has the advantage of a short compil- 
ing time. The linear-phase (therefore symmetric) FIR’s impulse response is 
assumed to be given by 

f[k] = {-1.0, 3.75, 3.75, -1.0}. (3.16) 

These coefficients can be directly encoded into a 4-bit fraction number. For 
example, 3.75io would have a 4-bit binary representation II.II 2 where 
denotes the location of the binary point. Note that it is, in general, more 
efficient to implement only positive CSD coefficients, because positive CSD 
coefficients have fewer nonzero terms and we can take the sign of the coef- 
ficient into account when the summation of the products is computed. See 
also the first two steps in the RAG algorithm 3.4 discussed later, p. 127. 

In a practical situation, the FIRs are obtained from a computer design tool 
and presented to the designer as floating-point numbers. The performance of 
a fixed-point FIR, based on floating-point coefficients, needs to be verified 
using simulation or algebraic analysis to ensure that design specifications 
remain satisfied. In the above example, the floating-point numbers are 3.75 
and 1.0, which can be represented exactly with fixed-point numbers, and the 
check can be skipped. 

Another issue that must be addressed when working with fixed-point de- 
signs is protecting the system from dynamic range overflow . Fortunately, the 
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worst-case dynamic range growth G of an L th -order FIR is easy to compute 
and it is: 



G<log 2 (^|/[*]|). (3.17) 

The total bit width is then the sum of the input bit width and the bit 
growth G. For the above filter for (3.16) we have G = log 2 (9.5) < 4, which 
states that the system’s internal data registers need to have at least four 
more integer bits than the input data to insure no overflow. If 8-bit internal 
arithmetic is used the input data should be bounded by ±128/9.5 = ±13. 

Example 3.2: Four-tap Direct FIR Filter 

The VHDL design 3 for a filter with coefficients { — 1,3.75, 3.75, —1} is shown 
in the following listing. 

PACKAGE eight_bit_int IS — User defined types 
SUBTYPE BYTE IS INTEGER RANGE -128 TO 127; 

TYPE ARRAY_BYTE IS ARRAY (0 TO 3) OF BYTE; 

END e ight _b i t _ int ; 

LIBRARY work; 

USE work. eight_bit_int .ALL; 

LIBRARY ieee; 

USE ieee . std_logic_1164 .ALL; 

USE ieee. std_logic_arith. ALL; 

ENTITY f ir_srg IS > Interface 

PORT (elk : IN STD.LOGIC; 

x : IN BYTE; 

y : OUT BYTE) ; 

END fir_srg; 

ARCHITECTURE flex OF fir.srg IS 

SIGNAL tap : ARRAY_BYTE; — Tapped delay line of bytes 
BEGIN 

pi: PROCESS > Behavioral Style 

BEGIN 

WAIT UNTIL elk = > 1 > ; 

— Compute output y with the filter coefficients weight. 

— The coefficients are [-1 3.75 3.75 -1]. 

— Division for Altera VHDL is only allowed for 
— powers-of-two values! 

y <= 2 * tap(l) + tap(l) + tap(l) / 2 + tap(l) / 4 

+ 2 * tap(2) + tap(2) + tap(2) / 2 + tap(2) / 4 
- tap(3) - tap(0) ; 

3 The equivalent Verilog code fir_srg.v for this example can be found in Ap- 
pendix A on page 453. 
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Fig. 3.9. VHDL simulation results of the FIR filter with impulse input 10. 



FOR I IN 3 D0WNT0 1 LOOP 

tap(I) <= tap(I-l); — Tapped delay line: shift one 
END LOOP; 

tap(0) <= x; — Input in register 0 

END PROCESS; 

END flex; 

The design is a literal interpretation of the Direct FIR architecture found in 
Fig. 3.1 (p. 110). The design is applicable to both symmetric and asymmetric 
filters. The output of each tap of the tapped delay line is multiplied with the 
appropriately weighted binary value and the results are added. The impulse 
response y of the filter to an impulse 10 is shown in Fig. 3.9. Note that 
MaxPlusII displays —10 as unsigned number, i.e., 256 — 10 = 246. | 3.2 | 



There are three obvious actions that can improve this design: 

1) Realize each filter coefficient with an optimized CSD code (see Chap. 2, 
Example 2.1, p. 36). 

2) Increase effective multiplier speed by pipelining. The output adder 
should be arranged in a pipelined balance tree. If the coefficients are coded 
as “powers-of-two,” the pipelined multiplier and the adder tree can be 
merged. Pipelining has low overhead due to the fact that the LC registers 
are otherwise often unused. A few additional pipeline registers may be nec- 
essary if the number of terms in the tree to be added is not a power of 
two. 

3) For symmetric coefficients, the multiplication complexity can be reduced 
as shown in Fig. 3.5 (p. 116). 

The first two actions are applicable to all FIR filters, while the third applies 
only to linear-phase (symmetric) filters. These ideas will be illustrated by 
example designs. 
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Table 3.3. Improved FIR filter. 



Symmetry 


no 


yes 


no 


no 


yes 


yes 


CSD 


no 


no 


yes 


no 


yes 


yes 


Tree 


no 


no 


no 


yes 


no 


yes 


Speed/MHz 


17.45 


29.98 


30.21 


63.29 


50.76 


64.51 


Size/LCs 


97 


74 


63 


113 


57 


65 



Example 3.3: Improved Four-tap Direct FIR Filter 

The design from the previous example can be improved using a CSD code for 
the coefficients 3.75 = 2 2 — 2 -2 . In addition, symmetry and pipelining can 
also be employed to enhance the filters performance. Table 3.3 shows the 
maximum throughput that can be expected for each different design. CSD 
coding and symmetry result in smaller, more compact designs. Improvements 
in Registered Performance are obtained by pipelining the multiplier and 
providing an adder tree for the output accumulation. Two additional pipeline 
registers (i.e., 16 LCs) are necessary, however. The most compact design is 
archived using symmetry and CSD coding without the use of an adder tree. 
The partial VHDL code for producing the filter output y is shown below. 
WAIT UNTIL elk = ’1’ ; 

tl <= tap(l) + tap (2) ; — Using symmetry 
t2 <= tap(O) + tap (3); 

y<=4*tl - tl / 4-t2; Apply CSD code and add 

The fastest design is obtained when all three enhancements are used. The 
partial VHDL code, in this case, becomes: 

WAIT UNTIL elk = ’1> ; 

tl <= tap(l) + tap(2) ; — Use symmetry of coefficients 
t2 <= tap(O) + tap(3); 

t3 <= 4 * tl - tl / 4; — Pipelined multiplier 

t4 <= -t2; — Build a binary tree and add delay 

y <= t3 + t4; 



| 3.3 | 



Rephasing Pipelined Multiplier in FIR Filter 

Sometimes a single coefficient has more pipeline delay than all the other 
coefficients. We can model this delay by f[n]z~ d . If we now add a positive 
delay with 

f[n] = z d f[n}z~ d (3.18) 

the two delays are eliminated. Translating this into hardware means that for 
the direct form FIR filter we have to use the ouput of the d position previous 
register. 




126 3. Finite Impulse Response (FIR) Digital Filters 




(b) 




Fig. 3.10. Rephasing FIR filter, (a) Principle, (b) Rephasing a multiplier. (1) 
Without pipelining. (2) With two-stage pipelining. 



This principle is shown in Fig. 3.10a. Figure 3.10b shows an example of 
rephasing a pipelined multiplier that has two delays. 



3.4.2 FIR Filter with Transposed Structure 

A variation of the Direct FIR filter is called the transposed filter and has 
been discussed in Sec. 3.2.1 (p. 111). The transposed filter enjoys, in the case 
of a constant coefficient filter, the following two additional improvements, 
compared with the direct form FIR: 

• Multiple use of the repeated coefficients using the reduced adder graph 
(RAG) algorithm [27, 28, 29, 30] 

• Pipeline adders using a carry-save adder 

The pipeline adder will increase the speed, at additional adder and register 
costs, while the RAG principle will reduce the size (i.e., number of LCs) of 
the filter and sometimes also increase the speed. The pipeline adder principle 
has been discussed in Chap. 2 and here we will focus on the RAG algorithm. 

In Chap. 2 it was noted that it can sometimes be advantageous to imple- 
ment the factors of a constant coefficient, rather than implement the CSD 
code directly. For example, the CSD code realization of the constant multi- 
plier coefficient 93 requires 3 adders, while the factors 3 x 31 only requires 
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two adders, see Fig. 2.2 (p. 38). For a transposed FIR filter, the probabil- 
ity is high that all the coefficients will have several factors in common. For 
instance, the coefficients 9 and 11 can be built using 8 + 1 = 9 for the first 
and 11 = 9 + 2 for the second. This reduces the total effort by one adder. 
In general, however, finding the optimal reduced adder graph (RAG) is an 
NP-hard problem. As a result, heuristics must be used. Some obvious actions 
can, however, reduce the effort. Specifically: 

Algorithm 3.4: Reduced Adder Graph 

a) Remove the sign of the coefficient because the sign can be realized by 
a subtraction in the filter’s tapped delay line. 

b) Remove all coefficients and factors that are a power of two, since they 
can be implemented by a hardwired data shift. 

c) Realize all cost u l” coefficients. 

d) Use cost “1” coefficients in building the multiplier of higher cost. 

Steps (a)-(c) are straightforward, but step (d) is potentially complex since 
the number of theoretical graphs increases exponentially. To simplify the 
process it is helpful to use the CSD coding data shown in Table 2.3 (p. 40). 
To illustrate the RAG algorithm, consider coding the coefficients defining the 
F6 half-band FIR filter of Goodman and Carey [68]. 

Example 3.5: Reduced Adder Graph for F6 Half-band Filter 

The half-band filter F6 has four nonzero coefficients, namely /[0], /[l], /[3], 
and /[ 5], which are 346, 208, —44, and 9. For a first cost estimation we convert 
the decimal values (index 10) into binary representations (index 2) and look- 
up the cost for the coefficients from Table 2.3 (p. 40). It follows that: 

f[k] Cost 

/[ 0] = 346io = 2 x 173 = 101011010 2 4 

/[ 1] = 208io = 2 4 x 13 = 11010000 2 2 

/[3] = — 44io = — 2 2 x 11 = -IOIIOO 2 2 

/[ 5] = 9io = 3 2 IOOI 2 1 

Total 9 

For the direct CSD code realization, 9 adders are required. The RAG algo- 
rithms proceeds as follows: 

Step To be Already Action 

realized realized 



0) {346, 208, -44, 9} { - } 



1) 

2 ) 

3) 

4) 

5) 



{346,208,44,9} 

{346,208,44,9} 

{173,13,11,9} 

{173,13,11} 

{173,13,11} 



{-} 
{ -} 
{-} 
{9} 
{9} 



Initialization 
No negative coefficients 
Remove 2 k coefficients 
Remove 2 k factors from coefficients 



Realize cost 1 coefficients 
Other coefficients are primes 
Apply the heuristic to the remaining coefficients, starting with the coefficient 
with the lowest cost and smallest value. It follows that: 
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Step Realize Already Action 

realized Find representation 

6) {173,13} {9,11} for 11 = 9+ 2 

7) {173} {9,11,13} for 13 = 9 + 4 

8) {-} {9,11,13,173} for 173 = (4 + 1)9+ 128 

Figure 3.11 shows the resulting reduced adder graph. The number of adders 
is reduced from 9 to 5. The adder path delay is also reduced from 4 to 3. | 3,5 | 



3.4.3 FIR Filter Using Distributed Arithmetic 

A completely different FIR architecture is based on the distributed arithmetic 
(DA) concept introduced in Sect. 2.7.1 (p. 88). In contrast to a conventional 
sum-of-products architecture, in distributed arithmetic we always compute 
the sum of products of a specific bit b over all coefficients in one step. This 
is computed using a small table and an accumulator with shifter. 

To illustrate, consider the three-coefficient FIR with coefficient {2,3, 1} 
found in Example 2.23 (p. 90). 

Example 3.6: Distributed Arithmetic Filter as State Machine 

A distributed arithmetic filter can be built in VHDL code 4 using the following 
state machine description: 

LIBRARY ieee; — Using predefined packages 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL ; 

PACKAGE da_package IS — User defined component 
COMPONENT case3 

PORT ( table.in : IN STD_L0GIC_VECT0R(2 DOWNTO 0) ; 

4 The equivalent Verilog code dafsm.v for this example can be found in Ap- 
pendix A on page 454. 
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table_out : OUT INTEGER RANGE 0 TO 6) ; 

END COMPONENT; 

END da_package; 

LIBRARY work; 

USE work . da_package . ALL ; 

LIBRARY ieee; — Using predefined packages 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL; 

ENTITY dafsm IS > Interface 

PORT (elk : IN STD.LOGIC; 

x_inO, x_inl, x_in2 : 

IN STD_LOGIC_VECTOR (2 DOWNTO 0); 
y : OUT INTEGER RANGE 0 TO 63) ; 

END dafsm; 

ARCHITECTURE flex OF dafsm IS 
TYPE STATE.TYPE IS (sO, si); 

SIGNAL state : STATE.TYPE; 

SIGNAL xO, xl, x2, table_in 

: STD_L0GIC_VECT0R(2 DOWNTO 0) ; 
SIGNAL table.out : INTEGER RANGE 0 TO 7 ; 

BEGIN 

table_in(0) <= x0(0); 
table_in(l) <= xl(0); 
table_in(2) <= x2(0); 

PROCESS > DA in behavioral style 

VARIABLE p : INTEGER RANGE 0 TO 63;-- Temp, register 
VARIABLE count : INTEGER RANGE 0 TO 3; — Counts shifts 
BEGIN 

WAIT UNTIL elk = >1’ ; 

CASE state IS 

WHEN sO => — Initialization step 

state <= si; 
count : = 0 ; 
p := 0; 
xO <= x_inO ; 
xl <= x_inl; 
x2 <= x_in2; 

WHEN si => — Processing step 

IF count = 3 THEN — Is sum of product done ? 

y <= p; — Output of result to y and 

state <= sO; — start next sum of product 

ELSE 

p := p / 2 + table_out * 4; 

xO (0) <= xO (1) ; 

xO (1) <= xO (2) ; 

xl (0) <= x 1 ( 1 ) ; 

xl (1) <= xl (2) ; 

x2 (0) <= x2 (1) ; 
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x2 (1) <= x2(2) ; 
count := count + 1; 
state <= si; 

END IF; 

END CASE; 

END PROCESS; 

LC_TableO: case3 

PORT MAP(table_in => table.in, table_out => table_out) ; 

END flex; 

The LC table 5 defined as CASE components was generated with the utility 
program dagen.exe. The output is show below. 

LIBRARY ieee ; 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL ; 

ENTITY case3 IS 

PORT ( table_in : IN STD_L0GIC_VECT0R(2 D0WNT0 0); 
table.out : OUT INTEGER RANGE 0 TO 6) ; 

END case3; 

ARCHITECTURE LCs OF case3 IS 
BEGIN 

— This is the DA CASE table for 

— the 3 coefficients: 2, 3, 1 

— automatically generated with dagen.exe — DO NOT EDIT! 

PROCESS (table.in) 

BEGIN 



CASE table_in IS 



WHEN 


"000" 


=> 


table_out 


<= 


0; 


WHEN 


"001" 


=> 


table_out 


<= 


2; 


WHEN 


"010" 


=> 


table_out 


<= 


3; 


WHEN 


"Oil" 


=> 


table, out 


<= 


5 ; 


WHEN 


"100" 


=> 


table, out 


<= 


1; 


WHEN 


"101" 


=> 


table, out 


<= 


3; 


WHEN 


"110" 


=> 


table, out 


<= 


4; 


WHEN 


"111" 


=> 


table, out 


<= 


6; 


WHEN 


OTHERS 


=> table_out 


<= 



END CASE; 
END PROCESS; 
END LCs; 



As suggested in Chap. 2, a shift /accumulator is used, which shifts only one 
position to the right for each step, instead of shifting k positions to the 
left. The simulation results, shown in Fig. 3.12, report the correct result 
(y = 18) for an input sequence {1,3,7}. The design runs with Registered 
Performance of 56.17 MHz and uses 37 LCs and no EABs. | 3,6 | 

5 The equivalent Verilog code case3.v for this example can be found in Ap- 
pendix A on page 455. 
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Fig. 3.12. Simulation of the 3-tap FIR filter with input {1, 3, 7}. 



By defining the distributed arithmetic table with a CASE statement, the 
synthesizer will use logic cells to implement the LUT. This will result in a fast 
and efficient design only if the tables are small. For large tables, alternative 
means must be found. In this case, we may use the 2-kbit embedded array 
blocks (EABs), which (as discussed in Chap. 1) can be configured as 2 8 x 
8, 2 9 x 4,2 10 x 2 or 2 11 x 1 tables. These two design paths are discussed in 
more detail in the following. 



Distributed Arithmetic Using Logic Cells 

The DA implementation of an FIR filter is particularly attractive for low- 
order cases due to LUT address space limitations (e.g., L < 4). It should 
be remembered, however, that FIR filters are linear filters. This implies that 
the outputs of a collection of low-order filters can be added together to de- 
fine the output of a high-order FIR, as shown in Fig. 2.33 (p. 94). Based 
on the LCs found in a FLEX10K device, namely 2 4 x 1-bit tables, a DA 
table for four coefficients can be implemented. The number of necessary LCs 
increases exponentially with order. Typically, the number of LCs is much 
higher than the number of EABs. For example, an EPF10K70RC240-4 con- 
tains 3744 LCs but only 9 EABs. Also, EABs can be used to efficiently im- 
plement RAMs and FIFOs and other high- valued functions. It is therefore 
sometimes desirable to use EABs economically. Using EABs in an unregis- 
tered mode will decrease the maximal bandwidth of the design. If the design 
is implemented using larger tables with a 2 b x 6 CASE statement, inefficient de- 
signs can result. Even choosing “Global Project Logic Synthesis” with 
the “Optimize Area” option, with “Reduce Logic” and “Duplicate Logic 
Extraction” enabled, which gives optimal area for the CASE table, will still 
result in a larger- than-expected design. The 2 s x 8 table implemented with 
one VHDL CASE statement only, for example, required 587 LCs. Figure 3.13 
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Fig. 3.13. Size comparison of synthesis results for different coding using the CASE 
statement with b input and outputs. 



shows the number of LCs necessary for tables having three to eight inputs 
using the CASE statement. The minimum size curve is obtained from the fact 
that in the FLEX10K a 4 — » 1 multiplexer can (using the cascade AND gate) 
be built with 2 LCs. A 16 1 multiplexer can therefore be built with 10 LCs 

(see help for busmux in MaxPlusII). 

Synthesizers typically try to optimize the logic equations and are not 
capable of recognizing structures. It is generally more efficient to realize the 
4-input table with CASE statements, followed by a (bus) multiplexer. In this 
model it is also straightforward to add additional pipeline registers to the 
modular design. For maximum speed, a register must be introduced behind 
each 2 — >• 1 multiplexer. This will, however, yield a higher LC count compared 
to the 2 LC implementation of a 4 -» 1 multiplexer. The following example 
illustrates the structure of a 5-input table. 

Example 3.7: Five-input DA Table 

The utility program dagen.exe accepts filter length and coefficients, and re- 
turns the necessary PROCESS statements for the 4-input CASE table followed by 
a multiplexer. The VHDL output for an arbitrary set of coefficients, namely 
{1, 3, 5, 7, 9}, is given 6 in the following listing: 

LIBRARY ieee ; 

USE ieee. std_logic_l 164 . ALL; 

USE ieee . std_logic_arith. ALL ; 

ENTITY caseBp IS 

PORT ( elk : IN STD.LOGIC; 

table.in : IN STD_L0GIC_VECT0R(4 DOWNTO 0) ; 
table. out : OUT INTEGER RANGE 0 TO 25) ; 

END case5p; 

ARCHITECTURE LCs OF case5p IS 

6 The equivalent Verilog code case5p.v for this example can be found in Ap- 
pendix A on page 456. 
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SIGNAL lsbs : STD_L0GIC_VECT0R(3 DOWNTO 0) ; 

SIGNAL msbsO : STD_L0GIC_VECT0R(1 DOWNTO 0); 

SIGNAL tableOoutOO, tableOoutOl : INTEGER RANGE 0 TO 25; 

BEGIN 

— These are the distributed arithmetic CASE tables for 

— the 5 coefficients: 1, 3, 5, 7, 9 

— automatically generated with dagen.exe — DO NOT EDIT! 

PROCESS 

BEGIN 

WAIT UNTIL elk = ’1’ ; 
lsbs(O) <= table_in(0) ; 
lsbs(l) <= table_in(l) ; 
lsbs(2) <= table_in(2) ; 
lsbs (3) <= table_in(3) ; 
msbsO (0) <= table_in(4) ; 
msbsO(l) <= msbsO(O); 

END PROCESS; 

PROCESS — This is the final DA MPX stage. 

BEGIN — Automatically generated with dagen.exe 

WAIT UNTIL elk = >1’ ; 

CASE msbsO (1) IS 

WHEN ’O’ => table.out <= tableOoutOO; 

WHEN ’1 J => table. out <= tableOoutOl; 

WHEN OTHERS => table.out <= 0; 

END CASE; 

END PROCESS; 

PROCESS — This is the DA CASE table 00 out of 1. 

BEGIN — Automatically generated with dagen.exe 

WAIT UNTIL elk = ’1 ’ ; 

CASE lsbs IS 



WHEN 


"0000" 


=> 


tableOoutOO <= 


0; 


WHEN 


M 0001 M 


=> 


tableOoutOO <= 


1; 


WHEN 


"0010" 


=> 


tableOoutOO <= 


3; 


WHEN 


"0011" 


=> 


tableOoutOO <= 


4; 


WHEN 


"0100" 


=> 


tableOoutOO <= 


5; 


WHEN 


"0101" 


=> 


tableOoutOO <= 


6; 


WHEN 


"0110" 


=> 


tableOoutOO <= 


8; 


WHEN 


"0111" 


=> 


tableOoutOO <= 


9; 


WHEN 


M 1000 M 


=> 


tableOoutOO <= 


7; 


WHEN 


"1001" 


=> 


tableOoutOO <= 


8; 


WHEN 


"1010” 


=> 


tableOoutOO <= 


10 


WHEN 


"1011" 


=> 


tableOoutOO <= 


11 


WHEN 


"1100" 


=> 


tableOoutOO <= 


12 


WHEN 


"1101 M 


=> 


tableOoutOO <= 


13 


WHEN 


"1110" 


=> 


tableOoutOO <= 


15 


WHEN 


"1111" 


=> 


tableOoutOO <= 


16 


WHEN 


OTHERS => 


tableOoutOO < : 


= 
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END CASE; 

END PROCESS; 

PROCESS — This is the DA CASE table 01 out of 1. 

BEGIN — Automatically generated with dagen.exe 

WAIT UNTIL elk = >1’ ; 

CASE lsbs IS 



WHEN 


M 0000 M 


=> 


tableOoutOl 


<= 


9; 


WHEN 


"0001" 


=> 


tableOoutOl 


<= 


10 


WHEN 


"0010" 


=> 


tableOoutOl 


<= 


12 


WHEN 


"0011" 


=> 


tableOoutOl 


<= 


13 


WHEN 


"0100" 


=> 


tableOoutOl 


<= 


14 


WHEN 


"0101" 


=> 


tableOoutOl 


<= 


15 


WHEN 


"0110" 


=> 


tableOoutOl 


<= 


17 


WHEN 


"0111" 


=> 


tableOoutOl 


<= 


18 


WHEN 


"1000" 


=> 


tableOoutOl 


<= 


16 


WHEN 


"1001" 


=> 


tableOoutOl 


<= 


17 


WHEN 


"1010" 


=> 


tableOoutOl 


<= 


19 


WHEN 


"1011" 


=> 


tableOoutOl 


<= 


20 


WHEN 


"1100" 


=> 


tableOoutOl 


<= 


21 


WHEN 


"1101" 


=> 


tableOoutOl 


<= 


22 


WHEN 


"1110" 


=> 


tableOoutOl 


<= 


24 


WHEN 


"1111" 


=> 


tableOoutOl 


<= 


25 


WHEN 


OTHERS => 


tableOoutOl < : 


= ( 



END CASE; 

END PROCESS; 

END LCs ; 

The five inputs produce two CASE tables and a 2 — >■ 1 bus multiplexer. The 
multiplexer may also be realized with a component instantiation using the 
LPM function busmux. The program dagen.exe writes a VHDL file with 
the name caseX.vhd, where X is the filter length that is also the input bit 
width. The file caseXp.vhdis the same table, except with additional pipeline 
registers. The component can be used directly in a state machine design or 
in an unrolled filter structure. I 3.7 I 



Referring to Fig. 3.13, it can be seen that the structured VHDL code 
improves on the number of required LCs. Figure 3.14 compares the different 
design methods in terms of speed. Reducing the LC count also improves the 
throughput, because the number of LC levels is reduced. Although we get a 
high Registered Performance using seven pipeline stages for a 2 8 x 8 table 
with 58.82 MHz the design may now be too large for some applications. We 
may also consider the partitioning technique (Exercise 3.6, p. 144), shown in 
Fig. 2.32 (p. 93), or implementation with an EAB, discussed next. 

DA Using Embedded Array Blocks 

As mentioned in the last section, it is not economical to use the 2-kbit EABs 
for a short FIR filter, mainly because the number of available EABs is lim- 
ited. Also, the maximum registered speed of an EAB is 76 MHz, and an LC 
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table implementation may be faster. The following example shows the DA 
implementation using a component instantiation of the EAB. 

Example 3.8: Distributed Arithmetic Filter using EABs 

The CASE table from the last example can be replaced by a EAB ROM. The 
ROM table is defined by file da3.mif. The default input and output configu- 
ration of the EAB is given by ’’REGISTERED. ” If it is not desirable to have a 
registered configuration, set LPM_ADDRESS_CONTROL => "UNREGISTERED” and 
LPM_OUTDATA => "UNREGISTERED." The VHDL code 7 for the state machine 
design is shown below: 

LIBRARY 1pm; 

USE 1pm. lpm_components. ALL; 

LIBRARY ieee; — Using predefined packages 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL ; 

USE ieee . std_logic_unsigned. ALL; — Contains conversion 

— VECTOR -> INTEGER 

ENTITY darom IS > Interface 

PORT (elk : IN STD.LOGIC; 
x_inO, x_inl, x_in2 

: IN STD_LOGIC_VECTOR (2 DOWNTO 0); 
y : OUT INTEGER RANGE 0 TO 63) ; 

END darom; 

ARCHITECTURE flex OF darom IS 
TYPE STATE.TYPE IS (sO, si); 

SIGNAL state : STATE.TYPE; 

SIGNAL xO, xl, x2, table.in, mem 

: STD_L0GIC_VECT0R(2 DOWNTO 0); 
SIGNAL table.out : INTEGER RANGE 0 TO 7 ; 

BEGIN 



7 The equivalent Verilog code darom. v for this example can be found in Ap- 
pendix A on page 457. 
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table_in(0) <= x0(0); 
table_in(l) <= xl(0); 
table_in(2) <= x2(0); 

PROCESS > DA in behavioral style 

VARIABLE p : INTEGER RANGE 0 TO 63 ; —Temp, register 
VARIABLE count : INTEGER RANGE 0 TO 3 ; 

BEGIN — Counts the shifts 

WAIT UNTIL elk = ’1’ ; 

CASE state IS 

WHEN sO => — Initialization step 

state <= si; 
count := 0; 

P := 0; 
xO <= x_inO; 
xl <= x_inl; 
x2 <= x_in2; 

WHEN si => — Processing step 

IF count = 3 THEN — Is sum of product done ? 

y <= p; — Output of result to y and 

state <= sO; — start next sum of product 

ELSE 

p := p / 2 + table_out * 4; 

xO (0) <= xO (1) ; 

xO (1) <= xO (2) ; 

xl(0) <= x 1 ( 1 ) ; 

x 1 ( 1 ) <= xl (2) ; 

x2 (0) <= x2 (1) ; 

x2 (1) <= x2 (2) ; 

count := count + 1; 

state <= si ; 

END IF; 

END CASE; 

END PROCESS; 

rom_l: lpm_rom 

GENERIC MAP ( LPM_WIDTH => 3, 

LPM.WIDTHAD => 3, 

LPM.OUTDATA => "UNREGISTERED", 
LPM_ADDRESS_CONTROL => "UNREGISTERED", 
LPM_FILE => "darom3 .mif ") 

PORT MAP ( address => table_in, q => mem) ; 

table.out <= CONV_INTEGER(mem) ; 

END flex; 

Compared with Example 3.6 (p. 128), we have now a component instan- 
tiation of the LPM_R0M. Because there is a need to convert between the 
STD_L0GIC_VECT0R output of the ROM and the integer, we have used the 
package std_logic_unsigned from the library ieee. The latter contains the 
C0NV_ INTEGER function for unsigned STD_L0GIC_VECT0R. 
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The include file darom3.mif was also generated with the program dagen.exe. 
The file has the following contents: 

— This is the DA MIF table for the 3 coefficients: 2,3, 1 



automatically generated with dagen.exe — DO NOT EDIT! 
WIDTH = 3; 

DEPTH = 8; 

ADDRESS.RADIX = dec; 

DATA.RADIX = dec; 

CONTENT BEGIN 

0 : 0 ; 

1 : 2 ; 

2 : 3; 

3 : 5; 

4 : 1; 

5 : 3; 

6 : 4; 

7 : 6; 

END; 

The design runs with 28.01 MHz and uses 34 LCs (three less than the CASE 
version) and one EAB (more precisely, 3/8 of an EAB). | 3.8 | 



But EABs have only a single address decoder and if we implement a 2 4 x 8 
table, a complete EAB would be consumed unnecessarily, and it can not be 
used elsewhere. For longer filters, however, the use of EABs is attractive 
because: 

• EABs have registered throughput at a constant 76 MHz, and 

• Routing effort is reduced 



Signed DA FIR Filter 

A signed DA filter will require a signed accumulator. The following example 
shows the VHDL code for the previously studied three-coefficient example, 
2.24 from Chap. 2 (p. 92). 

Example 3.9: Signed DA FIR Filter 

For the signed DA filter, an additional state is required. See the variable 
count 8 to process the sign bit. 

LIBRARY ieee; — Using predefined packages 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL; 

PACKAGE da_package IS — User defined components 
COMPONENT case3s 

PORT ( table.in : IN STD_L0GIC_VECT0R(2 D0WNT0 0); 
table.out : OUT INTEGER RANGE -2 TO 4) ; 

END COMPONENT; 



8 The equivalent Verilog code case3s.v for this example can be found in Ap- 
pendix A on page 459. 
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END da.package; 

LIBRARY work; 

USE work .da_package. ALL; 



LIBRARY ieee; — Using predefined packages 

USE ieee . std_logic_1164. ALL; 

USE ieee . std_logic_arith. ALL ; 

ENTITY dasign IS > Interface 

PORT (elk : IN STD.LOGIC; 
x_inO, x_inl, x_in2 

: IN STD_LOGIC_VECTOR (3 DOWNTO 0) ; 
y : OUT INTEGER RANGE -64 TO 63) ; 

END dasign; 



ARCHITECTURE flex OF dasign IS 

TYPE STATE.TYPE IS (sO, si); 

SIGNAL state : STATE.TYPE; 

SIGNAL table.in : STD_L0GIC_VECT0R(2 DOWNTO 0); 

SIGNAL xO, xl, x2 : STD_L0GIC_VECT0R(3 DOWNTO 0); 

SIGNAL table. out : INTEGER RANGE -2 TO 4; 

BEGIN 

table_in(0) <= x0(0); 

table_in(l) <= xl(0); 

table_in(2) <= x2(0); 

PROCESS > DA in behavioral style 

VARIABLE p : INTEGER RANGE -64 TO 63;— Temporary reg. 
VARIABLE count : INTEGER RANGE 0 TO 4; — Counts the 

BEGIN — shifts 

WAIT UNTIL elk = >1’ ; 

CASE state IS 

WHEN sO => — Initialization step 

state <= si; 
count : = 0 ; 
p := 0; 
xO <= x.inO ; 
xl <= x.inl ; 
x2 <= x_in2; 

WHEN si => — Processing step 

IF count = 4 THEN — Is sum of product done? 
y <= p; — Output of result to y and 

state <= sO; — start next sum of product 
ELSE 

IF count = 3 THEN — Subtract for last 

p := p / 2 - table.out * 8; — accumulator step 
ELSE 

p := p / 2 + table.out * 8; — Accumulation for 

END IF; — all other steps 
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FOR k IN 0 TO 2 LOOP -- Shift bits 
xO(k) <= xO (k+1) ; 
xl (k) <= xl (k+1) ; 
x2(k) <= x2 (k+1) ; 

END LOOP; 

count := count + 1; 
state <= si; 

END IF; 

END CASE; 

END PROCESS; 

LC_TableO: case3s 

PORT MAP(table_in => table_in, table_out => table_out) ; 

END flex; 

The LC table (component case3s.vhd) was generated using the program 
dagen.exe. The VHDL code 9 is shown below: 

LIBRARY ieee ; 

USE ieee . std_logic_l 164 . ALL; 

USE ieee . std_logic_arith. ALL; 

ENTITY case3s IS 

PORT ( table.in : IN STD_L0GIC_VECT0R(2 DOWNTO 0) ; 
table.out : OUT INTEGER RANGE -2 TO 4) ; 

END case3s; 

ARCHITECTURE LCs OF case3s IS 
BEGIN 

— This is the DA CASE table for 

— the 3 coefficients: -2, 3, 1 

— automatically generated with dagen.exe — DO NOT EDIT! 

PROCESS (table.in) 

BEGIN 



CASE table_in IS 



WHEN 


"000" 


=> 


table_out 


<= 


0; 


WHEN 


M 001 M 


=> 


table_out 


<= 


-2 


WHEN 


"010" 


=> 


table_out 


<= 


3; 


WHEN 


"Oil" 


=> 


table_out 


<= 


1; 


WHEN 


"100" 


=> 


table. out 


<= 


1; 


WHEN 


"101" 


=> 


table.out 


<= 


-1 


WHEN 


"110" 


=> 


table. out 


<= 


4; 


WHEN 


"HI" 


=> 


table.out 


<= 


2; 


WHEN 


OTHERS 


=> table.out 


<= 



END CASE; 
END PROCESS; 
END LCs; 



9 The equivalent Verilog code case3s.v for this example can be found in Ap- 
pendix A on page 460. 
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Fig. 3.15. Simulation of the 3-tap signed FIR filter with input {1, —3,7}. 



Figure 3.15 shows the simulation for the input sequence {1,— 3,7}. As ex- 
pected, the output y is — 4i 0 = lllllOCbc. The design uses 65 LCs and runs 
with 33.44 MHz Registered Performance. | 3 , 9 | 



To accelerate a DA filter, unrolled loops can be used. The input is applied 
sample by sample (one word at a time), in a bit-parallel form. In this case, 
for each bit of input a separate table is required. While the table size varies 
(input bit width equals number of filter taps), the contents of the tables are 
the same. The obvious advantage is a reduction of VHDL code size, if we 
use a component definition for the LC tables, as previously presented. To 
demonstrate, the unrolling of the 3-coefficients, 4-bit input example, previ- 
ously considered, is developed below. 

Example 3.10: Loop Unrolling for DA FIR Filter 

In a typical FIR application, the input values are processed in word parallel 
form (i.e., see Fig. 3.16). The following VHDL code 3 illustrates the unrolled 
DA code, according to Fig. 3.16. 

LIBRARY ieee; — Using predefined packages 

USE ieee . std_logic_1164. ALL; 

USE ieee . std_logic_arith. ALL ; 

PACKAGE da_package IS — User defined components 

COMPONENT case3s 

PORT ( table.in : IN STD_L0GIC_ VECTOR (2 D0WNT0 0) ; 
table.out : OUT INTEGER RANGE -2 TO 4) ; 

END COMPONENT; 

3 The equivalent Verilog code dapara.v for this example can be found in Ap- 
pendix A on page 461. 




3.4 Constant Coefficient FIR Design 141 




Fig. 3.16. Parallel implementation of a distributed arithmetic FIR filter. 



END da_package; 

LIBRARY work; 

USE work . da_package . ALL ; 

LIBRARY ieee; — Using predefined packages 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL; 

ENTITY dapara IS > Interface 

PORT (elk : IN STD.LOGIC; 

x_in : IN STD_L0GIC_VECT0R(3 DOWNTO 0) ; 
y : OUT INTEGER RANGE -46 TO 44) ; 

END dapara; 



ARCHITECTURE flex OF dapara IS 

SIGNAL xO, xl, x2, x3 : STD_L0GIC_VECT0R(2 DOWNTO 0) ; 
SIGNAL yO, yl, y2, y3 : INTEGER RANGE -2 TO 4; 

SIGNAL sO : INTEGER RANGE -6 TO 12; 

SIGNAL si : INTEGER RANGE -10 TO 8; 
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SIGNAL tO, tl, t2, t3 : INTEGER RANGE -2 TO 4; 
BEGIN 



PROCESS > DA in behavioral style 

BEGIN 

WAIT UNTIL elk = ’1’ ; 

FOR k IN 0 TO 1 LOOP — Shift all four bits 
xO (k) <= xO(k+l) ; 
xl (k) <= xl(k+l) ; 
x2(k) <= x2 (k+1) ; 
x3 (k) <= x3 (k+1) ; 

END LOOP; 

xO(2) <= x_in(0) ; — Load x_in in the 
xl(2) <= x_in(l) ; — MSBs of register 2 
x2 (2) <= x_in(2) ; 
x3 (2) <= x_in(3) ; 

y <= yO + 2 * yl + 4 * y2 - 8 * y3; 

— Pipeline register and adder tree 

— tO <= yO; tl <= yl; t2 <= y2 ; t3 <= y3 ; 
sO <= tO + 2 * tl; si <= t2 - 2 * t3; 

— y <= sO + 4 * si; 

END PROCESS; 



LC_TableO: case3s 

PORT MAP (table. in => 
LC.Tablel: case3s 

PORT MAP (table. in => 
LC_Table2: case3s 

PORT MAP (table.in => 
LC_Table3: case3s 

PORT MAP (table.in => 



xO, table.out => yO) ; 
xl, table.out => yl) ; 
x2, table.out => y2) ; 
x3, table.out => y3) ; 



END flex; 

The design uses 4 tables of size 2 3 x 4 and all tables have the same content 
as the table shown in Example 3.9 (p. 137). Figure 3.17 shows the simulation 
for the input sequence {1, —3, 7}. Because the input is applied serial (and bit- 
parallel) the expected result — 4io = IIIIIOO2C is computed at the 400-ns 
interval. I 3.10 I 



The previous design requires 39 LCs and runs at 31.84 MHz. An important 
advantage of the DA concept, compared with general-purpose MAC design, 
is that pipelining is easy archived. We can add additional pipeline registers 
at the table output and at the adder- tree output with no costs. To compute 
y, i.e., instead of 

y <= yO + 2 * yl + 4 * y2 - 8 * y3; 

we use signals tO to tl for the pipelined version within a PROCESS statement 
tO <= yO; tl <= yl; t2 <= y2; t3 <= y3; 
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sO <= tO + 2 * tl; si <= t2 - 2 * t3; 
y <= sO + 4 * si; 

The size of the design does not increase, because the registers of the LC table 
and adder are otherwise unused. But the Registered Performance increases 
from 31.84MHz to 83.33 MHz! 



Exercises 

3.1: A filter has the following specification: sampling frequency 2kHz; passband 
0-0.4 kHz, stopband 0.5-1 kHz; passband ripple, 3dB, and stopband ripple, 48 dB. 
Use the MatLab software and the “Interactive Lowpass Filter Design” demo from 
the Signal Processing Toolbox for the filter design. 

(al) Design a direct filter with a Kaiser window. 

(a2) Determine filter length and the absolute ripple in the passband. 

(bl) Design an equiripple filter (called REMEX). 

(b2) Determine filter length and the absolute ripple in the passband. 



Exercises Using MaxPlusII 

3.2: (a) Compute the RAG for a length- 11 half-band filter F5 that has the nonzero 
coefficients f[0] = 256, /[±1] = 150, /[±3] = -25, /[±5] = 3. 

(b) What is the minimum output bit width of the filter, if the input bit width is 8 
bits? 

(cl) Write and compile (with the MaxPlusII compiler) the VHDL code for the 
filter. 

(c2) Simulate the filter with impulse and step response. 

(d) Write the VHDL code for the filter in distributed arithmetic, using the state 
machine approach with the table realized as LPM_R0M. 



3.3: (a) Compute the RAG for length- 11 half-band filter F7 that has the nonzero 
coefficients /[ 0] = 512, /[±1] = 302, /[±3] = —53, /[±5] = 7. 

(b) What is the minimum output bit width of the filter, if the input bit width is 8 
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bits? 

(cl) Write and compile (with the MaxPlusII compiler) the VHDL code for the 
filter. 

(c2) Simulate the filter with impulse and step response. 



3.4: Hartley [69] has introduced a concept to implement constant coefficient filters, 
by exploiting common subexpressions across coefficients. For instance, the filter 

L — l 

y[n] = E a[k]x[n — &], (3.19) 

k = o 

with three coefficients a[k ] = {480, —302, 31}. The CSD code of these three coeffi- 
cients is given by 



480 

-302 

31 



512 256 128 64 32 16 8 4 2 1 



1 0 0 0 -1 0000 0 
0-1 0-1 01001 0 
0 0 0 0 10000 -1 



From the table we note that the pattern 

therefore build the temporary variable h[n 
filter output with 



1 0 
0 -1 



can be found four times. If we 



= 2x\n] — x\n — 1], we can compute the 



y[n ] = 25 6h[n] — 16h[n] — 32 h[n — 1] + h[n — 1]. (3.20) 

(a) Verify (3.20) by substituting h[n] = 2x[n\ — x[n — 1]. 

(b) How many adders are required to yield the direct CSD implementation of (3.19) 
and the implementation with subexpression sharing? 

(cl) Implement the filter with subexpression sharing with MaxPlusII for 8-bit 
inputs. 

(c2) Simulate the impulse response of the filter. 

(c3) Determine LC usage and Registered Performance. 



3.5: Use the subexpression method from Exercise 3.4 to implement a 4-tap filter 
with the coefficients a[k] = { — 1406, —1109, —894, 2072}. 

(a) Find the CSD code and the subexpression representation for the most frequent 
pattern. 

(b) Substitute for the subexpression a 2 or —2, respectively. Apply the subexpres- 
sion sharing one more time to the reduced set. 

(c) Determine the temporary equations and check by substitution back into (3.19). 

(d) How many adders are required to yield the direct CSD implementation of (3.19) 
and the implementation with subexpression sharing? 

(el) Implement the filter with subexpression sharing with MaxPlusII for 8-bit in- 
puts. 

(e2) Simulate the impulse response of the filter. 

(e3) Determine LC usage and Registered Performance. 



3.6: (al) Use the program dagen.exe to compile a DA table for the coefficients 
{20, 24, 21, 100, 13, 11, 19, 7} using multiple CASE statements. 

Synthesize the design for maximum speed and determine the size and Registered 
Performance. 

(a2) Simulate the design using a power-of-two 2 k ;0 < k <7 input values. 
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(b) Use the partitioning technique to implement the same table using two sets, 
namely {20, 24, 21, 100} and {13, 11, 19, 7}, and an additional adder. Synthesize the 
design for maximum speed and determine the size and Registered Performance. 
(b2) Simulate the design using a power-of-two 2 k ;0 <k <7 input values. 

(c) Compare the designs from (a) and (b). 




4. Infinite Impulse Response (HR) Digital 
Filters 



Introduction 

In Chap. 3 we introduced the FIR filter. The most important properties that 
make the FIR attractive (+) or unattractive (— ) for selective applications 
include: 

+ FIR linear-phase performance is easily achieved. 

+ Multiband filters are possible. 

+ The Kaiser window method allows iterative-free design. 

+ FIRs have a simple structure for decimators and interpolators (see 
Chap. 5). 

+ Nonrecursive filters are always stable and have no limit cycles. 

+ It is easy to get high-speed, pipelined designs. 

+ FIRs typically have low coefficient and arithmetic roundoff error budgets, 
and well-defined quantization noise. 

— Recursive FIR filters may be unstable because of imperfect pole/zero 
annihilation. 

— The sophisticated Parks-McClellan algorithms must be available for 
minimax filter design. 

— High filter length requires high implementation effort. 

Compared to an FIR filter, an HR filter can often be much more efficient 
in terms of attaining certain performance characteristics with a given filter 
order. This is because the HR filter incorporates feedback and is capable 
of realizing both zeros and poles of a system transfer function, whereas the 
FIR filter is an all-zero filter. In this chapter, the fundamentals of HR fil- 
ter design will be developed. The traditional approach to the design of HR 
filters involves the transformation of an analog filter, with defined feedback 
specifications, into the digital domain. This is a reasonable approach, mainly 
because the art of designing analog filters is highly advanced, and many stan- 
dard tables are available, i.e., [70]. We will review the four most important 
classes of these analog prototype filters in this chapter, namely Butterworth, 
Chebyshev I and II, and elliptic filters. 

The HR will be shown to overcome many of the deficiencies of the FIR, 
but to have some less desirable properties as well. The general desired (+) 
and undesired (— ) properties of an HR filter are: 
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Fig. 4.1. First-order HR filter used as lossy integrator. 



4- Standard design using an analog prototype filter is well understood. 

+ Highly selective filters can be realized with low-order designs that can 
run at high speeds. 

+ Design using tables and a pocket calculator is possible. 

+ For the same tolerance scheme, filters are short, compared with FIR 
filters. 

+ Closed-loop design algorithms can be used. 

— Nonlinear-phase response is typical, i.e., it is difficult to get linear-phase 
response. (Using an allpass filter for phase compensation results in twice 
the complexity.) 

— Limit cycles may occur for integer implementation. 

— Multiband design is difficult; only low, high, or bandpass filters are 
designed. 

— Feedback can introduce instabilities. (Most often, the mirror pole to the 
unit circle can be used to produce the same magnitude response, and the 
filter will be stable.) 

— It is more difficult to get high-speed, pipelined designs 

To demonstrate the possible benefits of using HR filters, we will discuss 
a first-order HR filter example. 

Example 4.1: Lossy Integrator I 

One of the basic tasks of a filter may be to smooth a noisy signal. Assume 
that a signal a?[n] is received in the presence of wideband zero-mean random 
noise. Mathematically, an integrator could be used to suppress the effects of 
the noise. If the average value of the input signal is to be preserved over a 
finite time interval, a lossy integrator is often used to process the signal with 
additive noise. Figure 4.1 displays a simple first-order lossy integrator that 
satisfies the discrete-time difference equation: 

3 

y[n + 1] = -y[n] + x[n\. (4.1) 

As we can see from the impulse response in Fig. 4.2a, the same functionality 
of the first-order lossy integrator can be achieved with a 15- tap FIR filter. 
The step response to the lossy integrator is shown in Fig. 4.2b. 
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(a) (b) 





Fig. 4.2. Simulation of lossy integrator with a = 3/4. (a) Impulse response for 
1000<5[n]. (b) Step response for 100<r[n]. 



The following VHDL code 1 shows a possible implementation of this HR filter. 

PACKAGE n_bit_int IS — User defined type 

SUBTYPE BITS15 IS INTEGER RANGE -2**14 TO 2**14-1; 

END n_bit_int; 

LIBRARY work; 

USE work .n_bit_int .ALL; 

LIBRARY ieee ; 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL; 

ENTITY iir IS 

PORT (x_in : IN BITS15; — Input 

y_out : OUT BITS15; — Result 

elk : IN STD.LOGIC); 

END iir; 

ARCHITECTURE flex OF iir IS 
SIGNAL x, y : BITS15; 

BEGIN 

PROCESS — Use FF for input and recursive part 
BEGIN 

WAIT UNTIL elk = ’1’ ; 
x <= x_in; 

y <=x+y/4+y/2; 
end process ; 

y_out <= y; — Connect y to output pins 



1 The equivalent Verilog code iir .v for this example can be found in Appendix A 
on page 462. 




150 4. Infinite Impulse Response (HR) Digital Filters 



a 






! _* H Timt [0 Ohs 



loans 



n 



•*-clk 
dP' n_m 
■W y_Qid 



*i-i 



VHu* 

0 

D 1000 
DO 



100 0ns 200 0ns 300.0 m 400 0ns 500 0ns 600.0ns 700.0ns 000 Qo* gQG.Oftf * 

f n fl n 



jjjjjT" _ _ _ _ 0 _ , . : 

0 rffil750 1562 142) I 31S 1 236 1 175 j 130 I S7 1 72 HH *0 1 3) X 22 f IE 1 12 I 9 f 6 f 



■_* - 



Fig. 4.3. Impulse response for MaxPlusII simulation of the lossy integrator. 



END flex; 

Registers have been implemented using a WAIT statement inside a PROCESS 
block, while the multiplication and addition is implemented using CSD code. 
The design uses 31 logic cells and runs at a speed of 42.91 MHz, if synthesized 
with an Optimize Speed=10 option. The response of the filter to an impulse, 
with amplitude 1000 is shown in Fig. 4.3, and agrees with the simulated 
results presented in Fig. 4.2a. I 4 .i I 



An alternative design approach using a “standard logic vector” data type 
and LPM_ADD_SUB megafunctions is discussed in Exercise 4.6 (p. 173). This 
second approach will produce longer VHDL code but will have the benefit of 
direct control, at the bit level, over the sign extension and multiplier. 



4.1 HR Theory 

A nonrecursive filter incorporates, as the name implies, no feedback. The 
impulse response of such a filter is finite, i.e., it is an FIR filter. A recursive 
filter, on the other hand has feedback, and is expected, in general, to have 
an infinite impulse response, i.e., to be an HR filter. Figure 4.4a shows filters 
with separate recursive and nonrecursive parts. A canonical filter is produced 
if these recursive and nonrecursive parts are merged together, as shown in 
Fig. 4.4b. The transfer function of the filter from Fig. 4.4 can be written as: 

Ea[l}z- 1 

F(z) = -^bi • (4.2) 

i - e m*~ l 
1=1 

The difference equation for such a system yields: 

L—l L—l 

y[ n ] = X] - l ] + - *]• 

/=0 / = 1 



(4.3) 
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Fig. 4 .4. Filter with feedback. 



Comparing this with the difference equation for the FIR filter (3.2) on p. 110, 
we find that the difference equation for recursive systems depends not only 
on the L previous values of the input sequence x[n], but also on the L — 1 
previous values of y[n]. 

If we compute poles and zeros of F(z), we see that the nonrecursive part, 
i.e., the numerator of F(z), produces the zeros poi, while the denominator of 
F(z) produces the poles Pool- 

For the transfer function, the pole/ zero plot can be used to look up the 
most important properties of the filter. If we substitute z = e*^ 7 in the 
z-domain transfer function, we can construct the Fourier transfer function 



n m - e- T 




L — 2 
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(4.4) 



by graphical means. This is shown in Fig. 4.5, for a specific amplitude (i.e., 
gain) and phase value. The gain at a specific frequency lu q is the quotient of 
the zero vectors vi and the pole vectors u/. These vectors start at a specific 
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zero or pole, respectively, and end at the frequency point, (^ iUoT , of interest. 
The phase gain for the example from Fig. 4.5 becomes 0(u)q) = c*o + aq — f3 0 . 




Fig. 4.5. Computation of transfer function using the pole/zero plot. Amplitude 
gain = vovi/uo, phase gain = n 0 -f a\ — (3q. 



Using the connection between the transfer function in the Fourier domain 
and the pole/zero plot, we can already deduce several properties: 

1) A zero on the unit circle po = e ja ’° T (with no annihilating pole) produces 
a zero in the transfer function in the Fourier domain at the frequency lu 0 . 

2 ) A pole on the unit circle Poq = e^ oT (and no annihilating zero) produces 
an infinite gain in the transfer function in the Fourier domain at the 
frequency cjq. 

3) A stable filter with all poles inside the unit circle can have any type of 
input signal. 

4) A real filter has single poles and zeros on the unit circle, while complex 
poles and zeros appear always in pairs, i.e., if ao + jcq is a pole or zero, 
ao — j«i must also be a pole or zero. 

5) A linear-phase (i.e., constant group delay) filter has all poles and zeros 
symmetric to the unit circle or at z — 0. 

If we combine observations 3 and 5, we find that, for a stable linear-phase 
system, all zeros must be symmetric to the unit circle and only poles at z = 0 
are permitted. 

An HR filter (with poles z / 0) can therefore be only approximately 
linear-phase. To achieve this approximation a well-known principle from ana- 
log filter design is used: an allpass has a unit gain, and introduces a nonzero 
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phase gain, which is used to achieve linearization in the frequency range of 
interest, i.e. , the passband. 



4.2 HR Coefficient Computation 



In classical HR design, a digital filter is designed that approximates an ideal 
filter. The ideal digital filter model specifications are mathematically con- 
verted into a set of specifications for an analog filter model using the bilinear 
z -transform given by: 



s = 



z — 1 
z + 1 



(4.5) 



A classic analog Butterworth, Chebyshev, or elliptic model can be synthe- 
sized from these specifications, and is then mapped into a digital HR using 
this bilinear z-transform. 

An analog Butterworth filter has a magnitude-squared frequency response 
given by: 

\F HI 2 = Tw- (4-6) 

1+lnrr) 



The poles of |F(u;)| 2 | are distributed along a circular arc at locations sep- 
arated by 7r / TV radians. More specifically, the transfer function is TV times dif- 
ferentiable at id = 0. This results in a locally smooth transfer function around 
0 Hz. An example of a Butterworth filter model is shown in Fig. 4.6(upper). 
Note that the tolerance scheme for this design is the same as for the Kaiser 
window and equiripple design shown in Fig. 3.7 (p. 120). 

An analog Chebyshev filter of Type I or II is defined in terms of a Cheby- 
shev polynomial Vn(u) = cos (TV cos (a;)), which forces the filter poles to reside 
on an ellipse. The magnitude-squared frequency response of a Type I filter is 
represented by: 



I^HI 2 



i 

i + £ ! ^(s) 



(4.7) 



An example of a typical Type I magnitude frequency and impulse response 
is shown in Fig. 4.7(upper). Note the ripple in the passband, and smooth 
stopband behavior. 

The Type II magnitude-squared frequency response is modeled as: 




(4.8) 





Fig. 4.6. Filter design with MatLab toolbox, (upper) Butterworth filter and 
(lower) elliptic Filter. 

(a) Transfer function, (b) Group delay of passband. (c) Pole/zero plot, (x = pole; 
o = zero). 



An example of a typical Type II magnitude frequency and impulse re- 
sponse is shown in Fig. 4.7(lower). Note that in this case a smooth passband 
results, and the stopband now exhibits ripple behavior. 

An analog elliptic prototype filter is defined in terms of the solution to 
the Jacobian elliptic function, Un(uj). The magnitude-squared frequency re- 
sponse is modeled as: 




The magnitude-squared and impulse response of a typical elliptic filter 
is shown in Fig. 4. 6 (lower). Observe that the elliptic filter exhibits ripple in 
both the passband and stopband. 

If we compare the four different HR filter implementations, we find that 
a Butterworth filter has order 19, a Chebyshev has order 8, while the elliptic 
design has order 6, for the same tolerance scheme shown in Fig. 3.8 (p. 121). 
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(a) (b) (c) 





Fig. 4.7. Chebyshev filter design with MatLab toolbox. Chebyshev I (upper) 
and Chebyshev II (lower). 

(a) Transfer function, (b) Group delay of passband. (c) Pole/zero plot (x = pole; 
o = zero). 



If we compare Figs. 4.6 and 4.7, we find that for the filter with shorter order 
the ripple increases, and the group delay becomes highly nonlinear. A good 
compromise is most often the Chebyshev Type II filter with medium order, 
a flat passband, and tolerable group delay. 

4.2.1 Summary of Important IIR Design Attributes 

In the previous section, classic IIR types were presented. Each model provides 
the designer with tradeoff choices. The attributes of classic IIR types are 
summarized as follows: 

• Butterworth: Maximally flat passband, flat stopband, wide transition 
band 

• Chebyshev I: Equiripple passband, flat stopband, moderate transition 
band 
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Fig. 4.8. Direct I form HR filter using multiplier blocks. 



• Chebyshev II: Flat passband, equiripple stopband, moderate transition 
band 

• Elliptic: Equiripple passband, equiripple stopband, narrow transition 
band 

For a given set of filter requirement, the following observations generally 
hold: 

• Filter order 

- Lowest: Elliptic 

- Medium: Chebyshev I or II 

- Highest: Butterworth 

• Passband characteristics 

- Equiripple: Elliptic, Chebyshev I 

- Flat: Butterworth, Chebyshev II 

• Stopband characteristics 

- Equiripple: Elliptic, Chebyshev II 

- Flat: Butterworth, Chebyshev I 

• Transition band characteristics 

- Narrowest: Elliptic 

- Medium: Chebyshev I+II 

- Widest: Butterworth 



4.3 HR Filter Implementation 

Obtaining an IIR transfer function is generally considered to be a straightfor- 
ward exercise, especially if design software like MatLab is used. IIR filters 
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can be developed in the context of many architectures. The most important 
structures are summarized as follows: 

• Direct I form (see Fig. 4.8) 

• Direct II form (see Fig. 4.9) 

• Cascade of first- or second-order systems (see Fig. 4.10a) 

• Parallel implementation of first- or second-order systems 
(see Fig. 4.10b). 

• BiQuad implementation of a typical second-order section found in basic 
cascade or parallel designs (see Fig. 4.11) 

• Normal [71], i.e., cascade of first- or second-order state variable systems 
(see Fig. 4.10a) 

• Parallel normal, i.e., parallel first- or second-order state variable systems 
(see Fig. 4.10b) 

• Continued fraction structures 

• Lattice filter (after Gray-Markel, see Fig. 4.12) 

• Wave digital implementation (after Fettweis [72]) 

• General state space filter 

Each architecture serves a unique purpose. Some of the general selection 
rules are summarized below: 

• Speed 

— High: Direct I & II 

— Low: Wave 

• Fixed-point arithmetic roundoff error sensitivity 

— High: Direct I & II 

— Low: Normal, Lattice 
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Fig. 4.10. (a) Serial implementation F(z) = Fk(z). (b) Parallel implemen- 

tation F(z) = J2k=i Fk ( z )- 




Fig. 4.11. Possible second-order section BiQuad with transfer function F(z) = 
(6[0] + -f b[2]z~ 2 )/ (a[0] + + a[2]*" 2 ). 



• Fixed-point coefficient roundoff error sensitivity 

- High: Direct I & II 

- Low: Parallel, Wave 

# Special properties 

- Orthogonal weight outputs: Lattice 

- Optimized second-order sections: Normal 

- Arbitrary HR specification: State variable 
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With the help of software tools like MatLab, the coefficients can easily be 
converted from one architecture to another, as demonstrated by the following 
example. 



Example 4.2: Butterworth Second-order System 

Assume we wish to design a Butterworth filter (order N = 10, passband Fp = 

0. 3 Fs) realized by second-order systems. We can use the following MatLab 
code to generate the coefficients: 

N=10 ; Fp=0 . 3 ; 

[B , A] =butter (N , Fp) 

[sos, gain] =tf 2sos (B , A) 

1. e., we first compute the Butterworth coefficient using the function butter () , 
and then convert this filter coefficient using the “transfer function to second- 
order section” function tf 2sos to compute the BiQuad coefficients. 

We will get the following results using MatLab: 

B = 0.0001,0.0012,0.0048,0.0112,0.0168, 

0.0168, 0.0112, 0.0048, 0.0012, 0.0001 
A = 1.0000,-3.5863,6.5587,-7.5520,5.9363, 

-3.2606, 1.2421, -.03146, 0.0479, -0.0033 
and for the second-order section we finally get: 

6[0, i] 6[1,«] 6[2 , z] a[0, i] a[l, i] a[2, i] 

1.0000 2.0019 0.9997 1.0000 -0.4709 0.0610 

1.0000 2.0762 1.0784 1.0000 -0.4936 0.1121 

1.0000 2.0274 1.0296 1.0000 -0.5434 0.2243 

1.0000 1.9695 0.9717 1.0000 -0.6310 0.4216 

1.0000 1.9250 0.9271 1.0000 -0.7786 0.7541 
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(a) (b) (c) 





Fig. 4.13. Tenth-order Butterworth filter showing (a) magnitude, (b) phase, and 
(c) group delay response. 



Figure 4.13 shows the transfer function, group delay, and the pole/zero plot 
of the filter. Note that all zeros are at zo t = —1, which can also be seen 
from the nominator coefficients of the second-order systems. Note also the 
rounding error in 6[1, i] = 2 and 6[0, *] = 6[2, i] = 1 . | 4.2 | 



4.3.1 Finite Wordlength Effects 

Crochiere and Oppenheim [73] have shown that the coefficient wordlength 
required for a digital filter is closely related to the coefficient sensitivities. 
Implementation of the same IIR filter can therefore lead to a wide range of 
required wordlengths. To illustrate some of the dynamics of this problem, 
consider an eighth-order elliptic filter analyzed by Crochiere and Oppenheim 
[73]. The resulting eighth-order transfer function was implemented with a 
Wave, Cascade, Parallel, Lattice, Direct I and II, and Continuous Fraction 
architecture. The estimated coefficient wordlength to meet a specific maximal 
passband error criterion was conservatively estimated as shown in the second 
column of Table 4.1. As a result, it can be seen that the Direct form needs 
more wordlength than the Wave or Parallel structure. This has led to the 
conclusion that a Wave structure gives the best complexity (MW) in terms 
of the bit- width (W) multiplier product (M), as can be seen from column six 
of Table 4.1. 

In the context of FIR filters (see Chap. 3), the reduced adder graph (RAG) 
technique was introduced in order to simplify the design of a block of several 
multipliers [74, 75]. Dempster and Macleod have evaluated the eighth-order 
elliptic filter from above, in the context of RAG multiplier implementation 
strategies. A comparison is presented in Table 4.2. The second column dis- 
plays the multiplier block size. For a Direct II architecture, two multiplier 
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Table 4.1. Data for eighth- order elliptic filter by Crochiere and Oppenheim [73] 
sorted according the costs M x W. 



Type 


Word- 
length W 


Mults 

M 


Adds 


Delays 


Cost 
M xW 


Wave 


11.35 


12 


31 


10 


136 


Cascade 


11.33 


13 


16 


8 


147 


Parallel 


10.12 


18 


16 


8 


182 


Lattice 


13.97 


17 


32 


8 


238 


Direct I 


20.86 


16 


16 


16 


334 


Direct II 


20.86 


16 


16 


8 


334 


Cont.-frac 


22.61 


18 


16 


8 


408 



blocks, of size 9 and 7, are required. For a Wave architecture, no two coef- 
ficients have the same input, and, as a result, no multiplier blocks can be 
developed. Instead, eleven individual multipliers must be implemented. The 
third column displays the number of adders/subtractors B for a canonical 
signed digit (CSD) design required to implement the multiplier blocks. Col- 
umn four shows the same result for single-optimized multiplier adder graphs 
(MAG) [76]. Column five shows the result for the reduced adder graph. Col- 
umn six shows the overall adder /wordwidth product for a RAG design. Table 

4.2 shows that Cascade and Parallel forms give comparable or better results, 
compared with Wave digital filters, because the multiplier block size is an 
essential criterion when using the RAG algorithms. Delays have not been 
considered for the FPGA design, because all the logic cells have an associ- 
ated flip-flop. 

4.3.2 Optimization of the Filter Gain Factor 

In general, we derive the HR integer coefficient from floating-point filter coef- 
ficients by first normalizing to the maximum coefficient, and then multiplying 



Table 4.2. Data for eighth-order elliptic filter implemented using CSD, MAG, and 
RAG strategies [74]. 



Type 


Block 

size 


CSD 

B 


MAG 

B 




RAG 

W(B + A) 


Cascade 


4 x 3, 2 x 1 


26 


26 


24 


453 


Parallel 


11 x 9,4 x 2, 1 x 1 


31 


30 


29 


455 


Wave 


11 x 1 


58 


63 


22 


602 


Lattice 


1 x 9, 8 x 1 


33 


31 


29 


852 


Direct I 


1 x 16 


103 


83 


36 


1085 


Direct II 


1 x 9, 1 x 7 


103 


83 


41 


1189 


Cont.-frac 


18 x 1 


118 


117 


88 


2351 
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Table 4.3. Variation of the gain factor to minimize filter complexity of the cascade 
filter. 





CSD 


MAG 


RAG 


Optimal gain 


1122 


1121 


1121 


# adders for optimal gain 


23 


21 


18 


# adders for gain = 1024 


26 


26 


24 


Improvement 


12% 


19% 


25% 



with the desired gain factor, i.e., bit-width 2 round ( iy ). However, most often it 
is more efficient to select the gain factor within a range, . . . 2^1 . There 
will be essentially no change in the transfer function, because the coefficients 
must be rounded anyway, after multiplying by the gain factor. If we apply, 
for instance, this search in the range . . .2^1 for the cascade filter in 

the Crochiere and Oppenheim design example from above (gain used in Table 
4.1 was 2 L 1 1 33 J — 1 = 1024), we get the data reported in Table 4.3. 

We note, from the comparison shown in Table 4.3 a substantial improve- 
ment in the number of adders required to implement the multiplier. Although 
the optimal gain factor for MAG and RAG in this case is the same, it can be 
different. 



4.4 Fast IIR Filter 

In Chap. 3, FIR filter Registered Performance was improved using pipelin- 
ing (see Fig. 3.6, p. 118). In the case of FIR filters, pipelining can be achieved 
at essentially no cost. Pipelining IIR filters, however, is more sophisticated 
and is certainly not free. Strategies reported to improve IIR filter throughput 
are: 

• Look-ahead interleaving in the time domain [77] 

• Clustered look-ahead pole/zero assignment [78, 79] 

• Scattered look-ahead pole/zero assignment [77, 80] 

• IIR decimation filter design [81] 

• Parallel processing [82] 

• RNS implementation [35, Sect. 4.2] [45] 

The first five methods are based on filter architecture or signal flow tech- 
niques, and the last is based on computer arithmetic (see Chap. 2). These 
techniques will be demonstrated with examples. To simplify the VHDL rep- 
resentation of each case, only a first-order IIR filter will be considered, but 
the same ideas can be applied to higher-order IIR filters and can be found in 
the literature references. 
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Fig. 4.14. Lossy integrator with look-ahead arithmetic. 



4.4.1 Time-domain Interleaving 

Consider the differential equation of a first-order HR system, namely 

y[n + 1] = ay[n] + bx[n]. (4.10) 

The output of the first-order system, namely y[n + 1], can be computed 
using a look-ahead methodology by substituting y[n + 1 ] into the differential 
equation for y[n + 2]. That is 

y[n + 2] = ay[n + 1 ] + bx[n + 1 ] = o 2 y[n ] + abx[n\ -f- bx[n + 1 ]. (4.11) 

The equivalent system is shown in Fig. 4.14. 

This concept can be generalized by applying the look-ahead transform for 
(5 — 1) steps, resulting in: 

5-1 

y[n + S] = a s y[n] + a k bx[n + S — 1 — k] . (4-12) 

k- 0 

v " 

(*)) 

It can be seen that the term ( 77 ) defines an FIR filter having coefficients 
{ 6 , ab , o 2 b , . . . , a s ~ 1 b }, that can be pipelined using the pipelining techniques 
presented in Chap. 3 (i.e. , pipelined multiplier and pipelined adder trees). 
The recursive part of (4.12) can now also be implemented with an 5-stage 
pipelined multiplier for the coefficient a s . We will demonstrate the look-ahead 
design with the following example. 

Example 4.3: Lossy Integrator II 

Consider again the lossy integrator from Example 4.1 (p. 148), but now with 
look-ahead. Figure 4.14 shows the look-ahead lossy integrator, which is a 
combination of a nonrecursive part (i.e., FIR filter for x), and a recursive 
part with delay 2 and coefficient 9/16. 







164 4. Infinite Impulse Response (HR) Digital Filters 



y[n + 2] = ^y[n + 1] + + 1] = + *[«]) + x[n + 1] 

= ^y[n] + + *[« + i]. (4-13) 

The VHDL code 2 shown below, implements the HR filter in look-ahead form. 
PACKAGE n_bit_int IS — User defined type 

SUBTYPE BITS15 IS INTEGER RANGE -2**14 TO 2**14-1; 

END n_bit_int; 

LIBRARY work; 

USE work .n_bit_int .ALL; 

LIBRARY ieee ; 

USE ieee. std_logic_l 164 . ALL; 

USE ieee. std_logic_arith. ALL; 

ENTITY iir.pipe IS 

PORT ( x_in : IN BITS15; — Input 

y_out : OUT BITS15; — Result 

elk : IN STD_L0GIC) ; 

END iir_pipe; 

ARCHITECTURE flex OF iir.pipe IS 
SIGNAL x, x3 , sx, y, y9 : BITS15; 

BEGIN 

PROCESS — Use FFs for input, output and pipeline stages 
BEGIN 

WAIT UNTIL elk = ’1’ ; 
x <= x_in; 

x3 <= x / 2 + x / 4; — Compute x*3/4 

sx <= x + x3; — Sum of x element i.e. output FIR part 

y9 <= y / 2 + y / 16; — Compute y*9/16 

y <= sx + y9; — Compute output 

END PROCESS; 

y_out <= y ; — Connect register y to output pins 

END flex; 

The pipelined adder and multiplier in this example are implemented in two 
steps. In the first stage, ^y[n] is computed. In the second stage, |r[n] -f 
x[n + 1] and ~i/[ra] are added. The design consumes 64 logic cells and runs 
with a Registered Performance of 49.75 MHz. The response of the filter to 
an impulse with amplitude 1000 is shown in Fig. 4.15. | 4.3 | 



2 The equivalent Verilog code iir_pipe.v for this example can be found in Ap- 
pendix A on page 463. 
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Fig. 4.15. VF1DL simulation of impulse response of the look-ahead lossy integrator. 



Comparing the look-ahead scheme with the 31LCs and 42.91 MHz solution 
reported in Example 4.1 (p. 148), we find that look-ahead pipelining has 
doubled the complexity of the design, with an attained speed-up of 16%. The 
comparison of the two filter’s response to the impulse with amplitude 1000 
shown in Fig. 4.2 (p. 149) and Fig. 4.15 reveals that the look-ahead scheme 
has an additional overall delay, and that the quantization effect differs by a 
±2 amount between the two methodologies. 

An alternative design approach, using a standard logic vector data type 
and LPM_ADD_SUB megafunctions, is discussed in Exercise 4.7 (p. 173). The 
second approach will produce longer VHDL code, but will have the benefit 
of direct control at the bit level of the sign extension and multiplier. 



4.4.2 Clustered and Scattered Look-Ahead Pipelining 



Clustered and scattered look-ahead pipelining schemes add self-canceling 
poles and zeros to the design to facilitate pipelining of the recursive portion 
of the filter. In the clustered method, additional pole/zeros are introduced in 
such a way that in the denominator of the transfer function the coefficients for 
z -1 ,z“ 2 , . . . , £~b s ' - C become zero. The following example shows clustering 
for a second-order filter. 



Example 4.4: Clustering Method 

A second-order transfer function is assumed to have a pole at 1/2 and 3/4 
and a transfer function given by: 

F ^ = 1- 1.25Z- 1 + 0.375«- 2 = (1 — 0.52 -1 )(l — 0.75z _1 ) 

Adding a canceling pole/zero at z = —1.25 results in a new transfer function 



F(z) = 



1 + 1.252’ 



(4.15) 



1 - 1.18752- 2 + 0.46882- 3 ' 

The recursive part of the filter can now be implemented with an additional 

pipeline stage. ED 



The problem with clustering is that the cancelled pole/zero pair may lie 
outside the unit circle, as is the case in the previous example (i.e., = 

— 1.25). This introduces instability into the design if the pole/zero annihi- 
lating is not perfect. In general, a second-order system with poles at ri,r 2 
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(a) (b) 





(c) (d) 





Fig. 4.16. Pole/zero plot for scattered look-ahead first-order HR filter. 

(a) F 1 ( 2 ) = (l + a 2 - 1 ). (b) F 2 (z) = l + a 2 z~ 2 . (c) Fs(z)= 1/(1- a 4 z~ 4 ). 
(d) F(z) = nF*(«) = (1 + a^-'Kl + a 2 z- 2 )/( 1 - a 4 z~ 4 ) = 1/(1 - az~ l ). 



k 



and with one extra canceling pair, has a pole location at l/(ri + r 2 ), which 
lies outside the unit circle for (r x + r 2 ) > 1. Soderstrand et al. [79], have 
described a stable clustering method, which in general introduces more than 
one canceling pole/zero pair. 

The scattered look-ahead approach does not introduce stability problems. 
It introduces (S — 1) canceling pole/zero pairs located at Zk — pe J7r/c / 5 , for 
an original filter with a pole located at p. The denominator of the transfer 
function has, as a result, only zero coefficients associated with the terms 
z°, z s , z~ 2S , etc. 



Example 4.5: Scattered Look-Ahead Method 



Consider implementing a second-order system having poles located at Zooi = 
0.5 and £002 = 0.25 with two additional pipeline stages. A second-order trans- 
fer function of a filter with poles at 1/2 and 1/4 has the transfer function 



F(z) 



1 

1 - 3/4* - 1 + 1/8 z- 2 ' 



(4.16) 
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The scattered look-ahead introduces two additional pipeline stages by adding 
pole/zero pairs at 0.25e ±:?27r / 3 and 0.5e^ 27r ^ 3 . Adding a canceling pole/zero 
at this location results in 



v 7 “ 1 + 0.52- 1 T 0.25z~ 2 

(1 + 0.5z _1 + 0.25^ _2 )(1 + .75 + 0.5625 j? -2 ) 

X (1 + ,75a- 1 + 0.56252- 2 )(l - 0.752- 1 + 0.125 z~ 2 ) 

1 + 1.25 2 _1 + 1.18752 -2 + 0.46872 “ 3 + .14062 -4 
- 1 - 0.54692-3 + 0.05272- 6 

_ 512 + 6402 _1 + 6082 -2 + 2402 -3 
~ ~ 512 - 2802- 3 + 272- 6 

and the recursive part can be implemented with two additional pipeline 
stages. | 4.5 | 



It is interesting to note that for a first-order HR system, clustered and 
scattered look-ahead methods result in the same pole/zero canceling pair 
lying on a circle around the origin with angle differences 2 7r/S. The nonre- 
cursive part can be realized with a “power-of-two decomposition” according 
to 



(1 + az -1 )(l + a 2 2 ~ 2 )(l + aV 4 ) • • • . (4.17) 

Figure 4.16 shows such a pole/zero representation for a first-order section, 
which enables an implementation with four pipeline stages in the recursive 
part. 



(a) (b) 





Fig. 4.17. (a) Transfer function, and (b) pole/zero distribution of a 37-order 
Martinez-Parks HR filter with 5 = 5. 
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4.4.3 HR Decimator Design 

Martinez and Parks [81] have introduced, in the context of decimation filters 
(see Chap. 5, p. 184), a filter design algorithm based on the minimax method. 
The resulting transfer function satisfies 

EM *- 1 

= ■ < 4 - 18 > 

1 — a[n\z~ nS 

n — 0 

That is, only every other S coefficient in the denominator is nonzero. In 
this case, the recursive part (i.e. , the denominator) can be pipelined with 
S stages. It has been found that in the resulting pole/zero distribution, all 
zeros are on the unit circle, as is usual for an elliptic filter, while the poles 
lie on circles, whose main axes have a difference in angle of 2 tt/S, as shown 
in Fig. 4.17b. 



4.4.4 Parallel Processing 

In a parallel-processing filter implementation [82], P parallel HR paths are 
formed, each running at a 1/P input sampling rate. They are combined at the 
output using a multiplexer, as shown in Fig. 4.18. Because a multiplexer, in 
general, will be faster than a multiplier and/or adder, the parallel approach 
will be faster. Furthermore, each path P has a factor of P more time to 
compute its assigned output. 

To illustrate, consider again a first-order system and P = 2. The look- 
ahead scheme, as in (4.11) 

y[n + 2] = ay[n + 1] -f x[n + 1] = a 2 y[n] + ax[n] -f x[n T 1] (4.19) 

is now split into even n — 2k and odd n = 2k — l output sequences, obtaining 




Fig. 4.18. Parallel HR implementation. The tapped delay lines (TDL) run with a 
1/p input sampling rate. 
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Fig. 4.19. Two-path parallel HR filter implementation. 



r 9] f 2/[2& 4" 2] — a^y[ 2 k] + ax[2Ar] + x[ 2 k + 1] 

yyn + /J-| y p k + 1 ] =a 2 y [ 2 jfe - 1] + ax[ 2 k - 1] + x[ 2 k ] ’ 



(4.20) 



where n,k E 2T The two equations are the basis for the following parallel HR 
filter FPGA implementation. 



Example 4.6: Lossy Integrator III 

Consider implementing a parallel lossy integrator, with a =3/4, as an ex- 
tension to the methods presented in Examples 4.1 (p. 148) and 4.3 (p. 163). 
A two-channel parallel lossy integrator, which is a combination of two non- 
recursive parts (i.e., an FIR filter for x), and two recursive parts with delay 
2 and coefficient 9/16, is shown in Fig. 4.19. The VHDL code 3 shown below 
implements the design. 

PACKAGE n_bit_int IS — User defined type 

SUBTYPE BITS15 IS INTEGER RANGE -2**14 TO 2**14-1; 

END n_bit_int; 



LIBRARY work; 

USE work .n_bit_int .ALL; 



LIBRARY ieee ; 

USE ieee . std_logic_l 164 . ALL; 

USE ieee . std_logic_arith. ALL ; 

ENTITY iir_par IS > Interface 

PORT ( elk : IN STD.LOGIC; 

x_in : IN BITS15; 

clk2 : OUT STD.LOGIC; 

y_out : OUT BITS15) ; 

END iir.par; 

ARCHITECTURE flex OF iir_par IS 
TYPE STATE_TYPE IS (even, odd); 

3 The equivalent Verilog code iir_par.v for this example can be found in Ap- 
pendix A on page 464. 
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SIGNAL 


state 




SIGNAL 


x_even, x_odd, 


xd_odd 


SIGNAL 


y_even, y_odd, 


y_wait 


SIGNAL 


x_e, x_o, y_e, 


y-o 


SIGNAL 


sum_x_even, sum_x_odd 


SIGNAL 


clk_div2 





STATE.TYPE; 
BITS15 ; 
BITS15 ; 
BITS15 ; 
BITS15 ; 
STD.LOGIC; 



BEGIN 



Multiplex: PROCESS — > Split x into even and odd samples 
BEGIN — > recombine y at elk rate 

WAIT UNTIL elk = >1’ ; 

CASE state IS 
WHEN even => 



x_even <= x_in; 



x_odd <= x_wait; 



clk_div2 <= ’ 1 ’ ; 
y <= y_wait; 
state <= odd; 
WHEN odd => 



x_wait <= x_in; 
y <= y_odd; 
y_wait <= y_even; 
clk_div2 <= ’O’; 
state <= even; 

END CASE; 

END PROCESS Multiplex; 



y_out <= y; 
clk2 <= clk_div2; 



Arithmetic: PROCESS 
BEGIN 

WAIT UNTIL clk_div2 = ’O’; 

sum_x_even <= (x_even * 2 + x_even) /4 + x_odd; 
y_even <= (y_even * 8 + y_even ) /16 + sum_x_even; 
xd_odd <= x_odd; 

sum_x_odd <= (xd_odd * 2 + xd_odd) /4 + x_even; 
y_odd <= (y_odd * 8 + y_odd) / 16 + sum_x_odd; 

END PROCESS Arithmetic; 



END flex; 



The design is realized with two PROCESS statements. In the first, PROCESS 
Multiplex, x is split into even and odd indexed parts, and the output y is re- 
combined at the elk rate. In addition, the first PROCESS statement generates 
the second clock, running at clk/2. The second block implements the filter’s 
arithmetic according to (4.20). Measuring the Registered Performance of 
the design with MaxPlusII software, a problem arises in that MaxPlusII can- 
not compute the Registered Performance in multiple clock domains. As- 
suming that the input multiplexer runs at twice the arithmetic speed, an 
estimated 31.34 MHz yields an input rate of more than 60 MHz. This is the 
result of the y_odd and y_even filters running off the clk_div2 clock. This 
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Fig. 4.20. VHDL simulation of the response of the parallel HR filter to an impulse 
1000. 



can be validated by the simulation shown in Fig. 4.20. The elk frequency 
was chosen to be higher than the computed Registered Performance. | 4.6 | 



The disadvantage of the parallel implementation, compared with the other 
methods presented, is the relatively high implementation cost of 215 LCs. 

4.4.5 HR Design Using RNS 

Because the residue number system (RNS) uses an intrinsically short word- 
length, it is an excellent candidate to implement fast (recursive) HR filters. In 
a typical HR- RNS design, a system is implemented as a collection of recursive 
and nonrecursive systems, each defined in terms of an FIR structure (see 
Fig. 4.21). Each FIR may be implemented in RNS-DA, using a quarter-square 
multiplier, or in the index domain, as developed in Chap. 2 (p. 43). 

For a stable filter, the recursive part should be scaled to control dynamic 
range growth. The scaling operation may be implemented with mixed radix 
conversion, Chinese remainder theorem (CRT), or the e — CRT method. For 




Fig. 4.21. RNS implementation of HR filters using two FIR sections and scaling. 
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high-speed designs, it is preferable to add an additional pipeline delay based 
on the clustered or scattered look-ahead pipelining technique [35, Sect. 4-2]. 
An RNS recursive filter design will be developed in detail in Sect. 5.3. It 
will be seen that RNS design will improve speed from 50 MHz to more than 
70 MHz. 



Exercises 

4.1: A filter has the following specification: sampling frequency 2 kHz; passband 
0-0.4 kHz, stopband 0.5-1 kHz; passband ripple, 3dB, and stopband ripple, 48 dB. 
Use the MatLab software and the “Interactive Lowpass Filter Design” demo from 
the Signal Processing Toolbox for the filter design. 

(al) Design a Butterworth filter (called BUTTER). 

(a2) Determine filter length and the absolute ripple in the passband. 

(bl) Design a Chebyshev type I filter (called CHEBYl). 

(b2) Determine filter length and the absolute ripple in the passband. 

(cl) Design a Chebyshev type II filter (called CHEBY2). 

(c2) Determine filter length and the absolute ripple in the passband. 

(dl) Design an elliptic filter (called ELLIP). 

(d2) Determine filter length and the absolute ripple in the passband. 



4.2: (a) Compute the maximum bit growth for a first-order HR filter with a pole 
at z qq — 3/4. 

(a2) Use the MatLab or C software to verify the bit growth using a step response 
of the first-order HR filter with a pole at ^oo =3/4. 

(b) Compute the maximum bit growth for a first-order HR filter with a pole at 
z qq = 3/8. 

(b2) Use the MatLab or C software to verify the bit growth using a step response 
of the first-order HR filter with a pole at z oo = 3/8. 

(c) Compute the maximum bit growth for a first-order HR filter with a pole at 

Zoo = P- 



Exercises Using MaxPlusII 

4.3: (a) Implement a first-order HR filter with a pole at z ooo = 3/8 and 12-bit 
input width, using MaxPlusII. 

(b) Determine the number of LCs and the Registered Performance. 

(c) Simulate the design with an input impulse of 100. 

(d) Compute the maximum bit growth for the filter. 

(e) Verify the result from (d) with a simulation of the step response with amplitude 

100 . 



4.4: (a) Implement a first-order HR filter with a pole at Zqqo =3/8, 12-bit input 
width, and a look-ahead of one step, using MaxPlusII. 

(b) Determine the number of LCs and the Registered Performance. 

(c) Simulate the design with an input impulse of 100. 

4.5: (a) Implement a first-order HR filter with a pole at Zoqo = 3/8, 12-bit input 
width, and a parallel design with two paths, using MaxPlusII. 

(b) Determine the number of LCs and the Registered Performance. 

(c) Simulate the design with an input impulse of 100. 
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4.6: (a) Implement a first-order HR filter as in Example 4.1 (p. 148), using a 15-bit 
std_logic_vector, and implement the adder with two lpm_add_ sub megafunctions, 
using MaxPlusII. 

(b) Determine the number of LCs and the Registered Performance. 

(c) Simulate the design with an input impulse of 1000, and compare the results to 
Fig. 4.3 (p. 150). 



4.7: (a) Implement a first-order pipelined HR filter from Example 4.3 (p. 163) 
using a 15-bit std_logic_vector, and implement the adder with four lpm_add_sub 
megafunctions, using MaxPlusII. 

(b) Determine the number of LCs and the Registered Performance. 

(c) Simulate the design with an input impulse of 1000, and compare the results to 
Fig. 4.15 (p. 165). 



4.8: Shajaan and Sorensen have shown that an HR Butterworth filter can be effi- 
ciently designed by implementing the coefficients as “signed-power-of-two” (SPT) 
values [83]. The transfer function of a cascade filter with N sections 



N 

F(*)=nsp] 

i=i 



b[l, 0] T 6[/, 1 ]z 1 -f- 6[Z, 2 \z 2 
a[/, 0] 4- a[l, l]z _1 + a[Z, 2 ]z~ 2 



(4.21) 



should be implemented using the second-order sections shown in Fig. 4.11 (p. 158). 
A tenth-order filter, as discussed in Example 4.2 (p. 159), can be realized with the 
following SPT filter coefficients [83]: 



/ 


S[l] 


1/ a[l, 0] 


a[l, 1] 


a[/, 2] 


1 


2 _1 


1 


-1-2- 4 


1 - 2 _: 


2 


2" 1 


2 _1 


-1 - 2 -1 


1 - 2~ l 


3 


2" 1 


2 _1 


-1 - 2 -1 


2 _1 + 2' 


4 


1 


2 -1 


-1 - 2 -2 


2 -2 + 2' 


5 


2 -1 


2 -1 


-1 - 2 -1 


2 -2 + 2' 



We choose 6[0] = 6[2] = 0.5 and 6[1] = 1 because the zeros of the Butterworth filter 
are all at z = — 1. 

(a) Compute and plot the transfer function of the first BiQuad and the complete 
filter. 

(b) Implement and simulate the first BiQuad for 8-bit inputs. 

(c) Build and simulate the 5-stage filter with MaxPlusII. 

(d) Determine LC usage and Registered Performance of the filter. 




5. Multirate Signal Processing 



Introduction 

A frequent task in digital signal processing is to adjust the sampling rate 
according to the signal of interest. Systems with different sampling rates are 
referred to as multirate systems. In this chapter, two typical examples will 
illustrate decimation and interpolation in multirate DSP systems. We will 
then introduce polyphase notation, and will discuss some efficient decimator 
designs. At the end of the chapter we will discuss filter banks and a quite 
new, highly celebrated addition to the DSP toolbox: wavelet analysis. 



5.1 Decimation and Interpolation 

If, after A/D conversion, the signal of interest can be found in a small fre- 
quency band (typically, lowpass or bandpass), then it is reasonable to filter 
with a lowpass or bandpass filter and to reduce the sampling rate. A narrow 
filter followed by a downsampler is usually referred to as a decimator [67]. 1 
The filtering, downsampling, and the effect on the spectrum is illustrated in 
Fig. 5.1. 

We can reduce the sampling rate up to the limit called the “Nyquist rate,” 
which says that the sampling rate must be higher than the bandwidth of the 
signal, in order to avoid aliasing. Aliasing is demonstrated in Fig. 5.2 for a 
lowpass signal. Aliasing is irreparable, and should be avoided at all cost. 

For a bandpass signal, the frequency band of interest must fall within an 
integer band . If / s is the sampling rate, and R is the desired downsampling 
factor, then the band of interest must fall between 

k m <f <{k + l) m ken - (51) 

If it does not, there may be aliasing due to “copies” from the negative 
frequency bands, although the sampling rate may still be higher than the 
Nyquist rate, as shown in Fig. 5.3. 

Increasing the sampling rate can be useful, in the D/A conversion process, 
for example. Typically, D/A converters use a sample-and-hold of first-order 



1 Some authors refer to a downsampler as a decimator. 
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Fig. 5.1. Decimation of signal a;[ra] o— • X(ui). 



at the output, which produces a step-like output function. This can be com- 
pensated for with an analog l/sinc(.r) compensation filter, but most often a 
digital solution is more efficient. We can use, in the digital domain, an ex- 
pander and an additional filter to get the desired frequency band. We note, 
from Fig. 5.4, that the introduced zeros produce an extra copy of the base- 
band spectrum that must first be removed before the signal can be processed 
with the D/A converter. The much smoother output signal of such an inter- 
polation 2 can be seen in Fig. 5.5. 



5.1.1 Noble Identities 

When manipulating signal flow graphs of multirate systems it is sometimes 
useful to rearrange the filter and downsampler /expander, as shown in Fig. 5.6. 
These are the so-called “Noble” relations [84]. For the decimator, it follows 

(; R) F(z) = F(z r ) 4 R), (5.2) 

2 Some authors refer to the expander as an interpolator. 
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Fig. 5.3. Integer band violation (©1995 VDI Press [4]). 



i.e., if the downsampling is done first, we can reduce the filter length F(z R ) 
by a factor of R. 

For the interpolator, the Noble relation is defined as 

F(z) {t R) = R) F(z r ), (5.3) 

i.e., in an interpolation putting the filter before the expander results in an 
R - times shorter filter. 

These two identities will become very useful when we discuss polyphase 
implementation in Sect. 5.2 (p. 179). 
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Fig. 5.6. Equivalent multirate systems (Noble relation). 



precisely, we first use an interpolator to increase the sampling rate by Ri, 
and then use a decimator to downsample by R 2 . Since the filters used for 
interpolation and decimation are both lowpass filters, it follows, from the 
upper configuration in Fig. 5.7, that we only need to implement the lowpass 
filter with the smaller passband frequency, i.e., 

f - = mi ” (k k) ■ {bA> 

This is graphically interpreted in the lower configuration of Fig. 5.7. 




Fig. 5.7. Noninteger decimation system, (upper) Cascade of an interpolator and a 
decimator. (lower) Result combining the lowpass filters. 



5.2 Polyphase Decomposition 

Polyphase decomposition is very useful when implementing decimation or in- 
terpolation in HR or FIR filter and filter banks. To illustrate this, consider 
the polyphase decomposition of an FIR decimation filter. If we add downsam- 
pling by a factor of R to the FIR filter structure shown in Fig. 3.1 (p. 110), 
we find that we only need to compute the outputs y[n] at time instances 

y[0} jy [Rly[2Rl... . (5.5) 

It follows that we do not need to compute all sums-of-product x[n]f[n — k] 
of the convolution. For instance, x[0] only needs to be multiplied by 
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m,f[R],f[2R],... . (5.6) 

Besides #[0], these coefficients only need to be multiplied by 

x[R\,x[2R},. .. . (5.7) 

It is therefore reasonable to split the input signal first into R separate se- 
quences according to 

R - 1 

x[n] = x r [n] 

r — 0 

x 0 [n] = {*[0],a;[7i], . . .} 

[n\ = {«[!], x[R+ 1], . . .} 



XR-i[n] = {x[R- l],x[2R- 1], . . .} 
and also to split the filter f[n\ into R sequences 



R-l 

/M = £/rM 

r = 0 

fo[n] = {f[0],f[R],...} 
/iW = {/[i], /[-R +!],•••} 



Figure 5.8 shows a decimator filter implemented using polyphase decompo- 
sition. Such a decimator can run R times faster than the usual FIR filter fol- 
lowed by a downsampler. The filters f r [n] are called polyphase filters, because 
they all have the same magnitude transfer function, but they are separated 
by a sample delay, which introduces a phase offset. 

A final example illustrates the polyphase decomposition. 



Example 5.1: Polyphase Decimator Filter 

Consider a Daubechies length-4 filter with G(z) and R = 2. 

G(z) = ((1 + V3) + (3 + s/3)z~ l + (3 - \/3 )z~ 2 + (1 - v^z -3 ) -4- 

4a/2 

G(z) = 0.48301 +0.8365z _1 + 0.2241z~ 2 - 0.1294z“ 3 . 

Quantizing the filter to 8 bits of precision results in the following model: 
G(z) = (l24 + 214z -1 + 57z -2 - 33z -3 ) /256 

G(z) = G 0 {z 2 )+z- l GRz 2 ) 



V256 256 / V 256 256 



')• 



and it follows that 
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Fig. 5.8. Polyphase realization of a decimation filter. 



_ 124 57 _! _ 214 _33_ _i 

256 + 256 Z 256 256 ' ^ 

The following VHDL code 3 shows the polyphase implementation for DB4. 
PACKAGE n_bits_int IS — User defined types 

SUBTYPE BITS8 IS INTEGER RANGE -128 TO 127; 

SUBTYPE BITS9 IS INTEGER RANGE -2**8 TO 2**8-l; 

SUBTYPE BITS17 IS INTEGER RANGE -2**16 TO 2**16-1; 

TYPE ARRAY_BITS17_4 IS ARRAY (0 TO 3) of BITS17; 

END n_bits_int; 



LIBRARY work; 

USE work. n_bits_int .ALL; 



LIBRARY ieee ; 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL; 

USE ieee . std_logic_signed. ALL; 

ENTITY db4poly IS > Interface 

PORT (elk : IN STD.LOGIC; 

x_in : IN BITS8 ; 

clk2 : OUT STD.LOGIC; 

x_e, x_o, gO, gl : OUT BITS17; 

y_out : OUT BITS9) ; 

END db4poly ; 

ARCHITECTURE flex OF db4poly IS 

TYPE STATE_TYPE IS (even, odd); 

SIGNAL state : STATE.TYPE; 

3 The equivalent Verilog code db4poly.v for this example can be found in Ap- 
pendix A on page 468. 
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SIGNAL x_odd, x_even, x_wait : BITS8; 

SIGNAL clk_div2 : STD_L0GIC; 

— Arrays for multiplier and taps: 

SIGNAL r : ARRAY_BITS17_4 ; 

SIGNAL x33 , x99, xl07, y : BITS17; 

BEGIN 

Multiplex: PROCESS 

BEGIN 

WAIT UNTIL elk = >1’ ; 

CASE state IS 
WHEN even => 
x_even <= x_in; 
x_odd <= x_wait; 
clk_div2 <= >1>; 
state <= odd; 

WHEN odd => 

x_wait <= x_in; 
clk_div2 <= ’O’ ; 
state <= even; 

END CASE; 

END PROCESS Multiplex; 

AddPolyphase : PROCESS (clk_div2 ,x_odd,x_even) 

VARIABLE m : ARRAY_BITS17_4 ; 

BEGIN 

— Compute auxiliary multiplications of the filter 

x33 <= x_odd * 32 + x_odd; 

x99 <= x33 * 2 + x33; 

xl07 <= x99 + 8 * x_odd; 

— Compute all coefficients for the transposed filter 



m (0) 


:= 4 * (32 * x_even 


ii 

i — i 
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m(l) 


:= 2 * xl07 ; 


— m[l] = 
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x.even) + x_even; — m[2] = 


= 57 
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— > Compute the filters and infer registers 
IF clk_div2 ’ event and (clk_div2 = ’O’) THEN 



Compute filter GO 



r (0) 


<= r(2) + m(0) ; 


- g[0] = 


127 


r (2) 


A 

ii 

B 

to 


- gC2] = 


57 
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— Compute filter G1 






r ( 1) 


<= -r (3) + m(l) ; 


- g[l] = 


214 


r (3) 


CO 

"a 

n 
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“ g[3] = 


-33 
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— Add the polyphase 


components 


n 

V 

>> 


r (0) + r (1) ; 







END IF; 

END PROCESS AddPolyphase; 

x_e <= x_even; — Provide some test signal as outputs 
x_o <= x_odd ; 
clk2 <= clk_div2; 
gO <= r (0) ; 



> Split into even and odd 
— samples at elk rate 
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Fig. 5.9. VHDL simulation of the polyphase implementation of the length-4 
Daubechies filter. 



gl <= r (1) ; 

y_out <= y / 256; — Connect to output 
END flex; 

The first PROCESS is the FSM, which includes the control flow and the splitting 
of the input stream at the sampling rate into even and odd samples. The 
second PROCESS includes the reduced adder graph (RAG) multiplier, and the 
last PROCESS hosts the two filters in a transposed structure. Although the 
output is scaled, there is potential growth by the amount ^2\gk\ = 1-673 < 
2 1 . Therefore the output y_out was chosen to have an additional guard bit. 
The design uses 208 LCs and runs with 78.74 MHz Registered Performance. 
A simulation of the filter is shown in Fig. 5.9. The first four input samples are 
a triangle function to demonstrate the splitting into even and odd samples. 
Impulses with an amplitude of 100 are used to verify the coefficients of the 
two polyphase filters. Note that the filter is not shift invariant! | 5.1 | 



From the VHDL simulation shown in Fig. 5.9, it can be seen that such 
a decimator is no longer shift invariant, resulting in a technically nonlinear 
system. This can be validated by applying a single impulse. Initializing at an 
even-indexed sample, the response is Go(z), while for an odd-indexed sample, 
the response is G\(z). 

5.2.1 Recursive IIR Decimator 

It is also possible to apply polyphase decomposition to recursive filters and 
to get the speed benefit, if we follow the idea from Martinez and Parks [81], 
in the transfer function 
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Fig. 5.10. Comparison of computational effort for decimators AF = f p - f s . 

L fla[l\z 

*■(*) = • (5-9) 

i — Y! b[i\ z ~ lR 

i=i 

i.e., the recursive part has only each R th coefficient. We have already discussed 
such a design in the context of HR filters (Fig. 4.17, p. 167). Figure 5.10 shows 
that, depending on the transition width AF of the filter, an HR decimator 
offers substantial savings compared with an FIR decimator. 



5.2.2 Fast-running FIR Filter 

An interesting application of polyphase decomposition is the so-called fast- 
running FIR filter. The basic idea of this filter is the following: If we de- 
compose the input signal x[n] into R polyphase components, we can use 



Winograd’s short convolution algorithms to implement 
demonstrate this with an example for R = 2. 


a fast filter. Let us 


Example 5.2: Fast-Running FIR filter 




We decompose the input signal X(z) and filter F(z 
polyphase components, i.e., 


) into even and odd 


x ( z ) = Fl x ^ z ~ n = x °( z 2 ) + z ~ l Xi(z 2 ) 

n 


(5.10) 


F(z) = 52 f [n)z~ n = Fo(z 2 ) + z~ 1 F 1 (z 2 ). 


(5.11) 



n 

The convolution in the time domain of x[n] and f[n] yields a polynomial 
multiply in the ^-domain. It follows for the output signal Y(z) that 
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Fig. 5.11. Fast-running FIR filter with R = 2. 



Y{z) = Yo(z 2 )+z~ 1 Y 1 (z 2 ) (5.12) 

= (X 0 (z 2 )Yz- l X 1 (z 2 )){F 0 (z 2 ) + z- l F,{z 2 )). (5.13) 

If we split (5.13) into the polyphase components Yo(z) and Y\(z) we get 

Yo(z) = Xo(z)F 0 (z) + z- 1 X l (z)F 1 (z) (5.14) 

Y 1 (z) = X 1 (z)F 0 (z) + Xo(z)F 1 {z). . (5.15) 

If we now compare (5.13) with a 2 x 2 linear convolution 

A(z) x B(z ) = (a[ 0] + * _1 a[l])(6[0] + z~ x b[ 1]) (5.16) 

= a[0]6[0] + z _1 (a[0]b[l] -f a [l]k[0]) + a[l]6[l]z~ 2 , (5.17) 

we notice that the factors for z -1 are the same, but for Yo(z) we must compute 
an extra addition to get the right phase relation. Winograd [85] has compiled 
a list of short convolution algorithms, and a linear 2x2 convolution can be 
computed using three multiplications and six adds with 

a[ 0] = x[0] — :r[l] a[ l] = #[0] a[ 2] = x[l] — a;[0] 

m = m - n i] b[ i] = /[o] t[2] = /[ i] - m 

c[k ] = a[k]b[k] k = 0, 1, 2 

y[0] = c[ 1] + c[2] y[l] = c[l] - c[0]. 

With the help of this short convolution algorithm, we can now define the 
fast-running filter as follows: 

'F 0 o 0 1 I" 1 - 1 ! r v i 

0 Fo + Fi 0 10 . (5.19) 

0 0 FiJ Ll^J l 1 -I 

Figure 5.11 shows the graphical interpretation. | 5.2 | 



(5.18) 



Vo' 




0 1-l" 




Yi 




-11 0 





If we compare the direct filter implementation with the fast-running FIR 
filter we must distinguish between hardware effort and average number of 
adder and multiplier operations. A direct implementation would have L mul- 
tipliers and L — 1 adders running at full speed. For the fast-running filter we 
have three filters of length L / 2 running at half speed. This results in 3L/4 
multiplications per output sample and (2 + 2)/2 + 3/2(L/2 — 1) = 3L/4+l/2 
additions for the whole filter, i.e., the arithmetic count is about 25% better 
than in the direct implementation. From an implementation standpoint, we 
need 3L/2 multipliers and 4 + 3(L/2 — 1) = 3L/2 + 1 adders, i.e., the effort is 
about 50% higher than in the direct implementation. The important feature 
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in Fig. 5.11 is that the fast-running filter basically runs at twice the speed 
of the direct implementation. Using a higher number R of decomposition 
may further increase the maximum throughput. The general methology for 
R polyphase signals with f a as input rate is now as follows: 

Algorithm 5.3: Fast-Running FIR Filter 

1) Decompose the input signal into R polyphase signals, using A e adders 
to form R sequences at a rate of f a /R • 

2) Filter the R sequences with R filters of length L/ R. 

3) Use A a additions to compute the polyphase representation of the 
output Yk(z). Use a final output multiplexer to generate the output 
signal Y(z). 

Note that the computed partial filter of length L/R may again be decom- 
posed, using Algorithm 5.3. Then the question arises: When should we stop 
the iterative decomposition? Mou and Duhamel [86] have compiled a table 
with the goal of minimizing the average arithmetic count. Table 5.1 shows 
the optimal decomposition. The criterion used was a minimum total number 
of multiplications and additions, which is typical for a MAC-based design. In 
Table 5.1, all partial filters that should be implemented based on Algorithm 
5.3 are underlined. 

For a larger length than 60, a fast convolution using the FFT is more 
efficient, and will be discussed in Chap. 6. 



Table 5.1. Computational effort for the recursive FIR decomposition [86, 87]. 
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5.3 Hogenauer CIC Filters 

A very efficient architecture for a high decimation-rate filter is the “cascade 
integrator comb” (CIC) filter introduced by Hogenauer [88]. The CIC (also 
known as the Hogenauer filter), has proven to be an effective element in 
high-decimation or interpolation systems. One application is in wireless com- 
munications, where signals, sampled at RF or IF rates, need to be reduced to 
baseband. For narrowband applications (e.g., cellular radio), decimation rates 
in excess of 1000 are routinely required. Such systems are sometimes referred 
to as channelizers [89]. Another application area is in UA data converters 

[90]. 

CIC filters are based on the fact that perfect pole/zero canceling can 
be achieved. This is only possible with exact integer arithmetic. Both two’s 
complement and the residue number system have the ability to support error- 
free arithmetic. In the case of two’s complement, arithmetic is performed 
modulo 2 b , and, in the case of the RNS, modulo M. 

An introductory case study will be used to demonstrate. 



5.3.1 Single-Stage CIC Case Study 

Figure 5.12 shows a first-order CIC filter without decimation in Tbit arith- 
metic. The filter consists of a (recursive) integrator (I-section), followed by a 
Tbit differentiator or comb (C-section). The filter is realized with Tbit val- 
ues, which are implemented in two’s complement arithmetic, and the values 
are bounded by — 8io = IOOO 2 C and 7io = OIII 20 

Figure 5.13 shows the impulse response of the filter. Although the filter is 
recursive, the impulse response is finite, i.e., it is a recursive FIR filter. This 
is unusual because we generally expect a recursive filter to be an HR filter. 
The impulse response shows that the filter computes the sum 



D - 1 

y[n] = ^2 x[ n - k], (5.20) 

k=0 

where D is the delay found in the comb section. The filter’s response is a 
moving average defined over D contiguous sample values. Such a moving 




Fig. 5.12. Moving average in Tbit arithmetic. 
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Fig. 5.13. Impulse response of the filter from Fig. 5.12. 



average is a very simple form of a lowpass filter. The same moving- average 
filter implemented as a nonrecursive FIR filter, would require D — 1 = 5 
adders, compared with one adder and one subtractor for the CIC design. 

A recursive filter having a known pole location has its largest steady-state 
sinusoidal output when the input is an “eigenfrequency” signal, one whose 
pole directly coincides with a pole of the recursive filter. For the CIC section, 
the eigenfrequency corresponds to the frequency u = 0 , i.e., a step input. 
The step response of the first-order moving average given by (5.20) is a ramp 
for the first D samples, and a constant y[n\ = D = 6 thereafter, as shown in 
Fig. 5.14. Note that although the integrator w[n] shows frequent overflows, 
the output is still correct. This is because the comb subtraction also uses two's 
complement arithmetic, e.g., at the time of the first wrap-around, the actual 
integrator signal is w\n] = — 8 io ^ IOOO 2 C 0 and the delay signal is w\n — 6 ] = 
2io = 00102 C- This results in y[n] = —810 — 2 i0 = IOOO 2 C — OOIO 2 C = 
OIIO 2 C = 610 , as expected. The accumulator would continue to count upward 
until w[n] = —810 = IOOO 2 C is again reached. This pattern would continue as 
long as the step input is present. In fact, as long as the output y[n) is a valid 4- 
bit two’s complement number in the range [— 8 , 7 ], the exact arithmetic of the 
two’s complement system will automatically compensate for the integrator 
overflows. 

I 11 general, a 4-bit filter width is usually much too small for a typical 
application. The Harris IC HSP43220, for instance, has five stages and uses a 
66 -bit integrator width. To reduce the adder latency, it is therefore reasonable 
to use a multibase RNS system. If we use, for instance, the set Z 30 = {2,3,5}, 
it can be seen from Table 5.2 that a total of 2 x 3 x 5 = 30 unique values 
can be represented. The mapping is unique (bijective) and is proven by the 
Chinese remainder theorem. 

Figure 5.15 displays the step response of the illustrated RNS implementa- 
tion. The filter’s output, y[n], has been reconstructed using data from Table 
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Table 5.2. RNS mapping for the set (2,3,5). 
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Fig. 5.14. Step response (eigenfrequency test) of the filter from Fig. 5.12. 



5.2. The output response is identical with the sample value obtained in the 
two’s complement case (see Fig. 5.14). A mapping that preserves the structure 
is called a homomorphism. A bijective homomorphism is called an isomor- 
phism (notation =), which can be expressed as: 

Z 30 — 2^2 x 2^3 x Z 5 . (5.21) 



5.3.2 Multistage CIC Filter Theory 



The transfer function of a general CIC system consisting of S stages is given 
by: 



F(z) 



-RD 



(5.22) 



where D is the number of delays in the comb section, and R the down- 
sampling (decimation) factor. 

It can be seen from (5.22) that F(z) is defined with respect to RDS zeros 
and S poles. The RD zeros generated by the numerator term (1 — z~ RD ) 




26-bit — H 26-bit 



26— bit I — ► 1 26— bit 



Fig. 5.16. CIC filter. Each stage 26-bit. 



are located on 2 tt/(RD) -radian centers beginning at z = 1. Each distinct 
zero appears with multiplicity S. The S poles of F(z) are located at z = 1. 
i.e., at the zero frequency (DC) location. It can immediately be seen that 
they are annihilated by S zeros of the CIC filter. The result is an 5-stage 
moving average filter. The maximum dynamic range growth occurs at the 
DC frequency (i.e., z — 1). The maximum dynamic range growth is 

-5grow (5D) Or frgrow — ^§2 (-^grow) bits. (5.23) 
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(a) 



xIO 7 (b) 




Knowledge of this value is important when designing a CIC filter, since the 
need for exact arithmetic as shown in the single-state CIC example. In prac- 
tice, the worst-case gain can be substantial, as evidenced by a 66-bit dynamic 
range built into commercial CIC filters (e.g., the Harris HSP43220 [89] chan- 
nelizer), typically designed using two’s complement arithmetic. 

Figure 5.16 shows a three-stage CIC filter that consists of a three-stage 
integrator, a sampling rate reduction by f?, and a three-stage comb. Note 
that all integrators are implemented first, then the decimator, and finally the 
comb sections. The rearrangement saves a factor R of delay elements in the 
comb sections. The number of delays D for a high-decimation rate filter is 
typically one or two. 

A three-stage CIC filter with an input wordwidth of eight bits, along with 
D — 2, R = 32, or DR = 2 x 32 = 64, would require an internal wordwidth 
of W = 8 -F 31og 2 (64) = 26 bits to ensure that run-time overflow would not 
occur. The output wordwidth would normally be a value significantly less 
than FF, such as 10 bits. 

Example 5.4: Three-Stage CIC Decimator I 

The worst-case gain condition can be forced by supplying a step (DC) signal 
to the CIC filter. Fig. 5.17a shows a step input signal with amplitude 127. 
Fig. 5.17b displays the output found at the third integrator section. Observe 
that run-time overflows occur at a regular rate. The CIC output shown in 
Fig. 5.17c is interpolated (smoothed) for display at the input sampling rate. 
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The output shown in Fig. 5.17d is scaled to 10-bit precision and displayed at 
the decimated sample rate. 

The following VHDL code 5 shows the CIC example design. 

PACKAGE n_bit_int IS — User defined types 

SUBTYPE word26 IS INTEGER RANGE -2**25 TO 2**25-l; 

END n_bit_int; 

LIBRARY work; 

USE work .n_bit_int .ALL; 

LIBRARY ieee ; 

USE ieee. std_logic_1164. ALL; 

USE ieee . std_logic_arith. ALL ; 

USE ieee . std_logic_unsigned. ALL ; 

ENTITY cic3r32 IS 

PORT ( elk : IN STD.LOGIC; 

x_in : IN STD_L0GIC_VECT0R(7 DOWNTO 0); 

clk2 : OUT STD_L0GIC; 

y.out : OUT STD_L0GIC_VECT0R(9 DOWNTO 0)); 

END cic3r32; 

ARCHITECTURE flex OF cic3r32 IS 

TYPE STATE.TYPE IS (hold, sample); 

SIGNAL state : STATE.TYPE ; 

SIGNAL count : INTEGER RANGE 0 TO 31; 

SIGNAL x : STD_L0GIC_VECT0R(7 DOWNTO 0) ; 

— Registered input 

SIGNAL sxt x : STD_L0GIC_VECT0R(25 DOWNTO 0); 

— Sign extended input 
SIGNAL iO, il , i2 : word26; — I section 0, 1, and 2 
SIGNAL i2dl , i2d2, i2d3 , i2d4, cl, cO : word26; 

— I and COMB section 0 
SIGNAL cldl, cld2 , cld3, cld4, c2 : word26; — COMB 1 

SIGNAL c2dl , c2d2, c2d3 , c2d4, c3 : word26 ; — COMB 2 

BEGIN 

FSM : PROCESS 
BEGIN 

WAIT UNTIL elk = ’ 1 ’ ; 

IF count = 31 THEN 
count <= 0 ; 
state <= sample; 
clk2 <= >1’; 

ELSE 

count <= count + 1 ; 
state <= hold; 
clk2 <= ’O’; 

END IF; 

5 The equivalent Verilog code cic3r32.v for this example can be found in Ap- 
pendix A on page 465. 
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END PROCESS FSM ; 

sxt : PROCESS (x) 

BEGIN 

sxtx (7 DOWNTO 0) <= x; 

FOR k IN 25 DOWNTO 8 LOOP 
sxtx(k) <= x(x’high); 

END LOOP; 

END PROCESS sxt; 

Int: PROCESS 
BEGIN 

WAIT UNTIL elk = >1’ ; 
x <= x_in; 

10 <= iO + CONV_INTEGER(sxtx) ; 

11 <= il + iO ; 

12 <= i2 + il ; 

END PROCESS Int; 

Comb: PROCESS 
BEGIN 

WAIT UNTIL elk = ’1’ ; 

IF state = sample THEN 
cO <= i2 ; 
i2dl <= cO; 
i2d2 <= i2dl ; 
cl <= cO - i2d2; 
cldl <= cl; 
cld2 <= cldl; 
c2 <= cl - cld2; 
c2dl <= c2 ; 
c2d2 <= c2dl ; 
c3 <= c2 - c2d2; 

END IF; 

END PROCESS Comb; 

y_out <= C0NV_STD_L0GIC_VECT0R(c3 / 2**16 , 10); 

END flex; 

The designed filter includes a finite state machine (FSM); a sign extension, 
sxt: PROCESS, and two “arithmetic” PROCESS blocks. The FSM : PROCESS con- 
tains the clock divider for the comb section. The Int : PROCESS realizes the 
three integrators. The Comb: PROCESS includes the three comb filters, each 
having a delay of two samples. The filter runs at 40.0 MHz and uses 401 LCs. 
Note that the filter would most likely not fit in the target device without 
the early downsampling. The early downsampling saves 3 x 32 x 26 = 2496 
registers or LCs! 

If we compare the filter outputs (Fig. 5.18 shows the VHDL output y_out 
and the response y[n\ from the MatLab simulation shown in Fig. 5.17d we 
see that the filter behaves as expected. ED 
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Fig. 5.18. VHDL simulation of the three-stage CIC filter shown in Fig. 5.16. 



Hogenauer [88] noted, based on a careful analysis, that some of the lower 
significant bits from early stages can be eliminated without sacrificing system 
integrity. Figure 5.19 displays the system’s magnitude frequency response 
for a design using full (worst-case) wordwidth in all stages, and using the 
wordlength “pruning 1 ’ policy suggested by Hogenauer. 



5.3.3 Amplitude and Aliasing Distortion 

The transfer function of an 5*-stage CIC filter was reported to be 



F(z) = 



1 - z 
1 - z- 



- Il l) 



(5.24) 




Fig. 5.19. CIC transfer function (f_s is sampling frequency at the input). 
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Aliasing components 




Fig. 5.20. Transfer function of a three-stage CIC decimator. Note that f s is the 
sampling frequency at the lower rate. 



The amplitude distortion and the maximum aliasing component can be com- 
puted in the frequency domain by evaluating F(z) along the arc z — ei 27r ^ T . 
The magnitude response becomes 



\F(f) I = 



sm(2n fTRD/ 2) \ S 
sin(2^/T/2) ) 



(5.25) 



which can be used to directly compute the amplitude distortion at the pass- 
band edge io p . Figure 5.20 shows \F(f — kj^)\ for a three-stage CIC filter 
with R = 3, D = 2, and RD — 6. Observe that several copies of the CIC 
filter’s low-frequency response are aliased in the baseband. 

It can be seen that the maximum aliasing component can be computed 
from \F(f) \ at the frequency 



f I Aliasing has maximum V(^-R) /p* (5.26) 

Most often, only the first aliasing component is taken into consideration, 
because the second component is smaller. Figure 5.21 shows the amplitude 
distortion at f p for different ratios of f p /(Df s ). 

Figure 5.22 shows, for different values of 5, R , and D , the maximum 
aliasing component for a special ratio of passband frequency and sampling 
frequency, / p // s . 

It may be argued that the amplitude distortion can be corrected with a 
cascaded FIR compensation filter, which has a transfer function l/\F(z)\ in 
the passband, but the aliasing distortion can not be repaired. Therefore, the 
acceptable aliasing distortion is most often the dominant design parameter. 
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Fig. 5.21. Amplitude distortion for the CIC decimator. 



5.3.4 Hogenauer Pruning Theory 



The total internal wordwidth is defined as the sum of the input wordwidth 
and the maximum dynamic growth requirement (5.23), or algebraically: 

-^intern — -^input T -^growth* (5.27) 

If the CIC filter is designed to perform exact arithmetic with this wordwidth 
at all levels, no run-time overflow will occur at the output. In general, input 
and output bit width of a CIC filter are in the same range. We find then 
that quantization introduced through pruning in the output is, in general, 
larger than quantization introduced by also pruning some LSBs at previous 
stages. If cr\ 25+1 is the quantization noise introduced through pruning in 
the output, Hogenauer suggested to set it equal to the sum of the noise a \ 
introduced by all previous sections. For a CIC filter with S integrator and S 
comb sections, it follows that: 



25 2N 

Y a T,k = Yj < a T,2S+ 1 

k=l k = 1 

2 _ 1 2 
a T,k ~ 2 ^ ,25 + 1 

Pi = £(M»]) 2 * = i,2,. 



. . , 25 , 



(5.28) 

(5.29) 

(5.30) 



where P£ is the power gain from stage k to the output. Compute next the 
number of bits , which should be pruned by 



Bk = 



0.51og 2 ( P k 2 x X <7 t )2S+1 



(5.31) 
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1/128 1/64 1/32 1/16 1/8 1/4 



D=2 




Fig. 5.22. Maximum aliasing for one- to four stage CIC decimator. 



1 1 

_2 I o2 (-Bin — -Bout “t-Bgrowth ) /K 321 

T ,k I j = 27V H- 1 ~ 12 ” 12 ’ v ‘ ; 

The power gain P^k = 5+ 1, . . . , 25 for the comb sections can be computed 
using the binomial coefficient 

Hk ( Z ) = 2S f^\-i r f 2S + l ~ k \ z -^ D 

n= 0 \ n J 

k = S,S+l,...,2S. (5.33) 

For computation of the first factor Pj; for k = 1,2 , . . . , S, it is useful to keep 
in mind that each integrator/comb pair produces a finite (moving average) 
impulse response. The resulting system for stage k is therefore a series of 
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Fig. 5.23. Rearrangement to simplify the computation of P% (©1995 VDI Press 

W). 



S — k + 1 integrator/comb pairs followed by Ar — 1 comb sections. Figure 5.23 
shows this rearrangement for a simplified computation of P%. 

The program cic.exe (included on the CD-ROM under book2e/util) 
computes this CIC pruning. The program produces the impulse response 
cicXX.imp and a configuration file cicXX.dat, where XX must be specified. 
The following design example explains the results. 

Example 5.5: Three-Stages CIC Decimator II 

Let us design the same overall CIC filter as in Example 4 (p. 191) but this time 
with bit pruning. The row data of the decimator were: Ri npu t = 8, LCutput = 
10, Bit R — 32, and D = 2. Obviously, the bit growth is 

^growth — f log 2 (RP )] — l 0 §2(b4 )[3 x 6] — 18, (5.34) 

and the total internal bit width becomes 

^intern = Rnput + ^growth = 8 + 18 = 26. (5.35) 

The program cic.exe shows the following results: 



— Program for the design of a CIC decimator. 



— 


Input bit width 


Bin = 


8 


” 


Output bit width 


Bout = 


10 


— 


Number of stages 


S 


3 


— 


Decimation factor 


R 


32 


— 


COMB delay 


D 


2 


— 


Frequency resolution DR = 


64 


— 


Passband freq. ratio 


P 


8 



Results of the Design 



Computed bit width: 

Maximum bit growth over all stages = 18 

Maximum bit width including sign Bmax+1 = 26 



— Stage 


1 


INTEGRATOR. Bit 


width 


26 


— Stage 


2 


INTEGRATOR. Bit 


width 


21 


— Stage 


3 


INTEGRATOR. Bit 


width 


16 


— Stage 


1 


COMB. 


Bit 


width 


14 


— Stage 


2 


COMB. 


Bit 


width 


13 
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— Stage 



3 COMB. Bit width : 

Maximum aliasing component 
Amplitude distortion 



12 



: 0.002135 = 53.41 dB 
: 0.729769 = 2.74 dB 



| 5.5 | 



The design charts shown in Figs. 5.21 and 5.22 may also be used to com- 
pute the maximum aliasing component and the amplitude distortion. If we 
compare this data with the tables provided by Hogemauer then the aliasing 
suppression is 53.4 dB (for Delay = 2 [88, Table II]), and the passband atten- 
uation is 2.74 dB [88, Table I]. Note that the Table I provided by Hogenauer 
are normalized with the comb delay, while the program cic.exe does not 
normalize with the comb delay. 

The following design example demonstrates the detailed bit- width design, 
using MaxPlusII. 

Example 5.6: Three-Stage CIC Decimator III 

The data for the design should be the same as for Example 5.4 (p. 191), but 
we now consider the pruning as computed in Example 5.5 (p. 198). 

The following VHDL code 6 shows the CIC example design with pruning. 

PACKAGE n_bit_int IS — User defined types 

SUBTYPE word26 IS INTEGER RANGE -2**25 TO 2**25-l; 

SUBTYPE word21 IS INTEGER RANGE -2**20 TO 2**20-l; 

SUBTYPE wordl6 IS INTEGER RANGE -2**15 TO 2**15-1; 

SUBTYPE wordl4 IS INTEGER RANGE -2**14 TO 2**14-1; 

SUBTYPE wordl3 IS INTEGER RANGE -2**13 TO 2**13-1; 

SUBTYPE wordl2 IS INTEGER RANGE -2**12 TO 2**12-1; 

END n_bit_int; 

LIBRARY work; 

USE work .n_bit_int .ALL; 

LIBRARY ieee ; 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL; 

USE ieee . std_logic_unsigned. ALL ; 



ENTITY cic3s32 IS 
PORT ( elk : 
x_in : 
clk2 : 
y_out : 
END cic3s32; 



IN STD.L0GIC; 

IN STD_L0GIC_VECT0R (7 D0WNT0 0); 
OUT STD_L0GIC; 

OUT STD_L0GIC_VECT0R(9 D0WNT0 0)); 



ARCHITECTURE flex OF cic3s32 IS 

TYPE STATE.TYPE IS (hold, sample); 

SIGNAL state : STATE.TYPE ; 

6 The equivalent Verilog code cic3s32.v for this example can be found in Ap- 
pendix A on page 466. 
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SIGNAL 


count 




INTEGER RANGE 0 TO 31; 


SIGNAL 


X 




STD_LOGIC_ 


VECTOR (7 DOWNTO 0) ; 










— Registered input 


SIGNAL 


sxtx 


STD_ 


L0GIG_VECT0R(25 DOWNTO 0); 










— Sign extended input 


SIGNAL 


iO : 


word26 ; 


— I section 0 


SIGNAL 


il : 


word21 ; 


— I section 1 


SIGNAL 


i2 : 


wordl6 ; 


— I section 2 


SIGNAL 


i2dl , 


i2d2 


i2d3 , i2d4 


, cl, cO : wordl4; 










— I and COMB section 0 


SIGNAL 


cldl, 


cld2 


cld3, cld4 


, c2 : wordl3 ; -- COMB 1 


SIGNAL 


c2dl, 


c2d2 


c2d3, c2d4 


, c3 : wordl2 ; — COMB 2 



BEGIN 



FSM : PROCESS 
BEGIN 



WAIT UNTIL 


elk = ’ 1 


IF count = 


31 THEN 


count <= 


0; 


state <= 


sample ; 


clk2 <= 


’ 1 ’ ; 


ELSE 




count <= 


count + 


state <= 


hold; 


clk2 <= 


’O’; 



END IF; 

END PROCESS FSM; 



Sxt : PROCESS (x) 

BEGIN 

sxtx (7 DOWNTO 0) <= x; 

FOR k IN 25 DOWNTO 8 LOOP 
sxtx(k) <= x(x’high); 

END LOOP; 

END PROCESS Sxt; 

Int: PROCESS 

BEGIN 

WAIT 

UNTIL elk = ’ 1 ’ ; 
x <= x_in; 

10 <= iO + CONV_ INTEGER (sxtx) ; 

11 <= il + iO / 32; 

12 <= i2 + il / 32; 

END PROCESS Int; 

Comb: PROCESS 
BEGIN 

WAIT UNTIL elk = ’1 ’ ; 

IF state = sample THEN 
cO <= i2 / 4; 
i2dl <= cO; 
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Fig. 5.24. VHDL simulation of the three-stage CIC filter, implemented with bit 
pruning. 



i2d2 <= i2dl ; 
cl <= cO - i2d2; 
cldl <= cl / 2; 
cld2 <= cldl; 
c2 <= cl / 2 - cld2 ; 
c2dl <= c2 / 2; 
c2d2 <= c2dl ; 
c3 <= c2 / 2 - c2d2 ; 

END IF; 

END PROCESS Comb; 

y_out <= C0NV_STD_L0GIC_VECT0R(c3 / 4, 10); 

END flex; 

The design has the same architecture as the unsealed CIC shown in Example 
5.4 (p. 191). The design consists of a finite state machine (FSM), a sign exten- 
sion sxt : PROCESS, and two “arithmetic” PROCESS blocks. The FSM: PROCESS 
contains the clock divider for the comb sections. The Int : PROCESS realizes 
the three integrators. The Comb : PROCESS includes the three comb sections, 
each having a delay of two. But now, all integrator and comb sections are 
designed with the bit width suggested by Hogenauer’s pruning technique. 
This reduces the size of the design to 238 LCs and the design now runs at 
44.64 MHz. | 5.6 | 



This design improved the speed from 40 to 44 MHz and also saved 163 LCs, 
or 40%, compared with the design considered in Example 5.4 (p. 191). Com- 
paring the filter output of the VHDL simulations, shown in Figs. 5.24 and 
5.18 (p. 194), different LSB quantization behavior can be noted (see Exercise 
5.11, p. 239). In the pruned design, “noise” possesses the asymptotic behavior 
of the LSB (507 4-4 508). 



5.3.5 CIC RNS Design 

The design of a CIC filter using the RNS was proposed by Garcia et al. 
[45]. A three-stage CIC filter, with 8-bit input, 10-bit output, D = 2, and 
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Fig. 5.25. CIC filter. Detail of design with base removal scaling (BRS). 



R — 32 was implemented. The maximum word width was 26 bits. For the 
RNS implementation, the 4-moduli set (256,63,61,59), i.e., one 8-bit two’s 
complement and three 6-bit moduli, covers this range (see Fig. 5.25). The 
output was scaled using an £-CRT requiring eight tables and three two’s 
complement adders [39, Fig. 1], or (as shown in Fig. 5.26) using a base removal 
scaling (BRS) algorithm based on two 6-bit moduli (after [38]), and an e-CRT 
for the remaining two moduli, for a total of five modulo adders and nine 
ROM tables, or seven tables (if multiplicative inverse ROM and the e-CRT 
are combined). The following table shows the speed in MSPS and the number 
of LEs and EABs used for the three scaling schemes. 



Type 


s-CRT 


BRS £-CRT 
(Speed data for 
BRS m 4 only) 


BRS c-CHI 
combined 
ROM 


MSPS 


58.8 


70.4 


58.8 


#LE 


34 


87 


87 


#Table (EAB) 


8 


9 


7 



The decrease in speed to 58.8 MSPS, for the scaling schemes 1 and 3, is 
the result of the need for a 10-bit £-CRT. It should be noted that this does 
not reduce the system speed, since scaling is applied at the lower (output) 
sampling rate. For the BRS £-CRT, it is assumed that only the BRS m 4 
part (see Fig. 5.13) must run at the input sampling rate, while BRS m 3 and 
£-CRT run at the output sampling rate. 

Some resources can be saved if a scaling scheme, similar to Example 5.5 
(p. 198), and illustrated in Fig. 5.25, is used. With this scheme, the BRS 
£-CRT scheme must be applied to reduce the bit width in the earlier sections 
of the filter. The early use of ROMs decreases the possible throughput from 
76.3 to 70.4 MSPS, which is the maximum speed of the BRS with m 4 . At the 
output, the efficient £-CRT scheme was applied. 
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BRS m 4 BRS m 3 e-CRT 




X 4 



Fig. 5.26. BRS and e-CRT conversion steps. 



The following table summarizes the three implemented filter designs, with- 
out including the scaling data. 



Type 


2C 


RNS 


Detailed bit width 




26-bit 


8, 6, 6, 6-bit 


RNS design 


MSPS 


49.3 


76.3 


70.4 


#LEs 


343 


559 


355 



5.4 Multistage Decimator 

If the decimation rate R is large it can be shown that a multistage design 
can be realized with less effort than a single-stage converter. In particular, 
S stages, each having a decimation capability of Rk, are designed to have 
an overall downsampling rate of R = R 1 R 2 • • • Rs • Unfortunately, passband 
imperfections, such as ripple deviation, accumulate from stage to stage. As a 
result, a passband deviation target of s p must normally be tightened on the 
order of e p = £ P /S to meet overall system specifications. This is obviously a 
worst-case assumption, in which all short filters have the maximum ripple at 
the same frequencies, which is, in general, too pessimistic. It is often more 
reasonable to try an initial value near the given passband specification s p , 
and then selectively reduce it if necessary. 
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5.4.1 Multistage Decimator Design Using Goodman— Carey 
Half-band Filters 

Goodman and Carey [68] proposed to develop multistage systems based on 
the use of CIC and half-band filters. As the name implies, a half-band filter 
has a passband and stopband located at w s = tu p = 7t/2, or midway in the 
baseband. A half-band filter can therefore be used to change the sampling 
rate by a factor of two. If the half-band filter has point symmetry relative to 
u = 7r/2, then all even coefficients (except the center tap) become zero. 

Definition 5.7: Half-band Filter 

The impulse response of a half-band filter is symmetric to k 
obeys the following rule 

f[k] = 0 k — even without k = d. 

The same condition transformed in the z-domain reads 
F(z) + F(-z) = c x z~ d , 
where c £ C is a constant and d E No- 

Goodman and Carey [68] have compiled a list of integer half-band filters that, 
with increased length, have smaller amplitude distortions. Table 5.3 shows 
the coefficients of these half-band filters. To simplify the representation, the 
coefficients were noted with a center tap located at d = 0. FI is the moving- 
average filter of length L, i.e., it is Hogenauers CIC filter, and may therefore 
be used in the first stage also, to change the rate with a factor other than 
two. Figure 5.27 shows the transfer function of the nine different filters. Note 
that in the logarithmic plot of Fig. 5.27, the point symmetry (as is usual for 
half-band filters) cannot be observed. 

The basic idea of the Goodman and Carey multistage decimator design is 
that, in the first stages, filters with larger ripple and less complexity can be 
applied, because the passband-to-sampling frequency ratio is relatively small. 
As the passband-to-sampling frequency ratio increases, we must use filters 



= d and 

(5.36) 

(5.37) 



Table 5.3. Coefficients of the half-band filter Fl to F9 from Goodman and Carey 
[ 68 ]. 



Name 


L 


Ripple 


m 


/[ i] 


/[3] 


/[5] 


m 


/[9] 


Fl 


3 


_ 


1 


1 










F2 


3 


— 


2 


1 










F3 


7 


— 


16 


9 


-1 








F4 


7 


36 dB 


32 


19 


-3 








F5 


11 


— 


256 


150 


-25 


3 






F6 


11 


49 dB 


346 


208 


-44 


9 






F7 


11 


77 dB 


512 


302 


-53 


7 






F8 


15 


65 dB 


802 


490 


-116 


33 


-6 




F9 


19 


78 dB 


8192 


5042 


-1277 


429 


-116 


18 
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Fig. 5.27. Transfer function of the half-band filter Fl to F9. 



with less distortion. The algorithm stops at R — 2. For the final decimation 
(R = 2 to R = 1), a longer half-band filter must be designed. 

Goodman and Carey have provided the design chart shown in Fig. 5.28. 
Initially, the input oversampling factor R and the necessary attenuation in 
the passband and stopband A = A p = A s must be computed. From this 
starting point, the necessary filters for R, R/2 1 R/4, . . . can be drawn as a 
horizontal line (at the same stopband attenuation). The filters F4 and F6-F9 
have ripple in the passband (see Exercise 5.8, p. 238), and if several such 
filters are used it may be necessary to adjust s p . We may, therefore, consider 
the following adjustment 

A = -20 log 10 £ p for F1-F3, F5 (5.38) 

A = -201og 10 min(^,£ s ) for F4, F6-F9, (5.39) 

where S' is the number of stages with ripple. 

We will demonstrate the multistage design with the following example. 




Fig. 5.28. Goodman and Carey design chart [68]. 
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Fig. 5.29. Design example for Goodman and Carey half-band filter, (a) Design 
chart, (b) Transfer function |F(u/)|. 



Example 5.8: Multistage Half-band Filter Decimator 

We wish to develop a decimator with R = 160, e p = 0.015, and e s = 0.031 = 
30 dB, using the Goodman and Carey design approach. 

At first glance, we can conclude that we need a total of five filters and mark 
the starting point at R = 160 and 30 dB in Fig. 5.29a. From 160 to 32, we 
use a CIC filter of length L — 5. This CIC filter is followed by two F2 filter 
and one F3 filter to reach R = 8. Now we need a filter with ripple. It follows 
that 

A = — 201og 10 min (^^- ) o.03l) = 36.48 dB. (5.40) 

From Fig. 5.28, we conclude that for 36 dB the filter F4 is appropriate. We 
may now compute the whole filter transfer function IF(u;)| by using the Noble 
relation (see Fig. 5.6, p. 179) F(z) = Fl(z)F2(z^)F2(z 10 )F3(z 20 )F4(z 40 ), 
which is shown in Fig. 5.29b. Figure 5.29a shows the design algorithm, using 
the design chart from Fig. 5.28. | 5.8 | 



Example 5.8 shows that considering only the filter with ripple in (5.39) 
was sufficient. Using a more pessimistic approach, with 5 = 6, we would have 
obtained A = —20 log(0. 015/6) = 52 dB, and we would have needed filter F8, 
with essentially higher effort. It is therefore better to start with an optimistic 
assumption and possibly correct this later. 



5.5 Frequency-Sampling Filters as Bandpass Decimators 

The CIC filters discussed in Sect. 5.3 (p. 187) belong to a larger class of 
systems called frequency-sampling filters (FSFs). Frequency-sampling filters 
can be used, as channelizer or decimating filter, to decompose the information 
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Fig. 5.30. Cascading of frequency-sampling filters to save a factor of R delays for 
multirate signal processing [4, Sect. 3.4]. 



spectrum into a set of discrete subbands, such as those found in multiuser 
communication systems. A classic FSF consists of a comb filter cascaded with 
a bank of frequency-selective resonators [4, 55]. The resonators independently 
produce a collection of poles that selectively annihilate the zeros produced 
by the comb prefilter. Gain adjustments are applied to the output of the 
resonators to shape the resulting magnitude frequency response of the overall 
filter. An FSF can also be created by cascading all-pole filter sections with 
all-zero filter (comb) sections, as suggested in Fig. 5.30. The delay of the 
comb section, Id z z~ D , is chosen so that its zeros cancel the poles of the all- 
pole prefilter as shown in Fig. 5.31. Wherever there is a complex pole, there 
is also an annihilating complex zero that results in an all-zero FIR, with the 
usual linear-phase and constant group-delay properties. 

Frequency-sampling filters are of interest to designers of multirate filter 
banks due, in part, to their intrinsic low complexity and linear-phase behav- 
ior. FSF designs rely on exact pole-zero annihilation and are often found in 
embedded applications. Exact FSF pole-zero annihilation, can be guaranteed 
by using polynomial filters defined over an integer ring using the two’s com- 
plement or the residue number system (RNS). The poles of an FSF filter 
developed in this manner can reside on the periphery of the unit circle. This 
conditionally unstable location is acceptable, due to the guarantee of exact 
pole-zero cancellation. Without this guarantee, the designer would have to 
locate the poles of the resonators within the unit circle, with a loss in per- 
formance. In addition, by allowing the FSF poles and zeros to reside on the 
unit circle, a multiplier-less FSF can be created, with an attendant reduction 
in complexity and an increase in data bandwidth. 

Consider the filter shown in Fig. 5.30. It can be shown that first-order fil- 
ter sections (with integer coefficients) produce poles at angles of 0° and 180°. 
Second-order sections, with integer coefficients, can produce poles at angles 
of 60°, 90°, and 120°, according to the relationship 2 cos(27r/i/D) = l, 0, and 
— 1. The frequency selectivity of higher-order sections is shown in Table 5.4. 
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Fig. 5.31. Example of pole/zero-compensation for a pole-angle of 60° and comb 
delay D = 6. 



The angular frequencies for all polynomials having integer coefficients with 
roots on the unit circle, up to order six, are reported. The building blocks 
listed in Table 5.4 can be used to efficiently design and implement such FSF 
filters. For example, a two’s complement (i.e., RNS single modulus) filter 
bank was developed for use as a constant-Q speech processing filter bank. 
It covers a frequency range from 900 to 8000Hz [91, 92], using 16kHz sam- 
pling frequency. An integer coefficient half-band filter HB6 [68] anti-aliasing 
filter and a third-order multiplier-free CIC filter (also known as Hogenauer 
filter [88] see Sect. 5.3, p. 187), was then added to the design to suppress un- 
wanted frequency components, as shown in Fig. 5.32. The bandwidth of each 
resonator can be independently tuned by the number of stages and delays 
in the comb section. The number of stages and delays is optimized to meet 
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Table 5.4. Filters with integer coefficients producing unique angular pole locations 
up to order six. Shown are the filter coefficients and nonredundant angular locations 
of the roots on the unit circle. 



C k (z) 


Order 


a 0 


a i 


<32 


«3 


d4 


<35 


<36 


01 


02 


03 


—Ci (z) 


1 


1 


-1 












0° 






C 2 (z) 


1 


1 


1 












o 

o 

00 






C e (z) 


2 


1 


-1 


1 










60° 






C\(z) 


2 


1 


0 


1 










90° 






Cs(z) 


2 


1 


1 


1 










120° 






C 12 (z) 


4 


1 


0 


-1 


0 


1 






30° 


150° 




Cio(z) 


4 


1 


-1 


1 


-1 


1 






36° 


108° 




Cs(z) 


4 


1 


0 


0 


0 


1 






45° 


135° 




Cs(z) 


4 


1 


1 


1 


1 


1 






to 


144° 




C 16 (z) 


6 


1 


0 


0 


-1 


0 


0 


1 


20.00° 


100.00° 


140.00° 


C 14 (z) 


6 


1 


-1 


1 


-1 


1 


-1 


1 


25.71° 


77.14° 


128.57° 


C T (z) 


6 


1 


1 


1 


1 


1 


1 


1 


51.42° 


102.86° 


154.29° 


Cg(z) 


6 


1 


0 


0 


1 


0 


0 


1 


40.00° 


80.00° 


160.00° 




Fig. 5.32. Design of a filter bank consisting of a half-band and CIC prefilter and 
FSF comb- resonator sections. 



the desired bandwidth requirements. All frequency-selective filters have two 
stages and delays. 

The filter bank was prototyped using a Xilinx XC4000 FPGA with the 
complexity reported in Table 5.5. Using high-level design tools (XBLOCKS 
from Xilinx), the number of used CLBs was typically 20% higher than the 
theoretical prediction obtained by counting adders, flip-flops, ROMs, and 
RAMs. 

The design of an FSF can be manipulated by changing the comb delay, 
channel amplitude, or the number of sections. For example, adaptation of 
the comb delay can easily be achieved because the CLBs are used as 32 x 1 
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Table 5.5. Number of used CLBs of Xilinx XC4000 FPGAs (Notation: F20D90 
means filter pole-angle 20.00° delay Comb D = 90). Total: Actual 1572 CLBs, 
nonrecursive FIR: 11292 CLBs 





F20D90 


F25D70 


F36D60 


F51D49 


F72D40 


F90D40 


Theory 


122 


184 


128 


164 


124 


65 


Practice 


160 


271 


190 


240 


190 


93 


Nonre. FIR 


2256 


1836 


1924 


1140 


1039 


1287 





F120D33 


F180D14 


HB6 


III 


D4 


D5 


Theory 


86 


35 


122 


31 


24 


24 


Practice 


120 


53 


153 


36 


33 


33 


Nonre. FIR 


1260 


550 











memory cells, and a counter realizes specific comb delays with the CLB used 
as a memory cell. 



5.6 Filter Banks 

A digital filter bank is a collection of filters having a common input or out- 
put, as shown in Fig. 5.33. One common application of the analysis filter 
bank shown in 5.33a is spectrum analysis, i.e., to split the input signal into R 
different so-called subband signals. The combination of several signals into a 
common output signal, shown in Fig. 5.33b, is called a synthesis filter bank. 
The analysis filter may be nonoverlapping, slightly overlapping, or substan- 
tially overlapping. Figure 5.34 shows an example of a slightly overlapping 
filter bank, which is the most common case. 

Another important characteristic that distinguishes different classes of 
filter banks is the bandwidth and spacing of the center frequencies of the 
filters. A popular example of a nonuniform filter bank is the octave-spaced or 
wavelet filter bank , which will be discussed in Sect. 5.7 (p. 230). In uniform 
filter banks , all filters have the same bandwidth and sampling rates. From 
the implementation standpoint, uniform, maximal decimating filter banks 
are often preferred, because they can be realized with the help of an FFT 
algorithm, as shown in the next section. 



5.6.1 Uniform DFT Filter Bank 

In a maximal decimating, or critically sampled filter bank, the decimation or 
interpolation R is equal to the number of bands K. We call it a DFT filter 
bank if the r th band filter h r [n] is computed from the “modulation” of a single 
prototype filter h[n\, according to 
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Fig. 5.33. Typical filter bank decomposition system showing (a) analysis, and (b) 
synthesis filters. 



h r [n] = /i[n]IV£ n = h[n\e- j2m l R . (5.41) 

An efficient implementation of the R channel filter bank can be generated 
if we use polyphase decomposition (see Sect. 5.2, p. 179) of the filter h r [n] 
and the input signal x\n]. Because each of these bandpass filters is critically 
sampled, we use a decomposition with R polyphase signals according to 



R-l 

h[n] = ^2 ^ hk\m\ — h[mR— k] (5.42) 

k — 0 
R-l 

*M = E Xk[n] Xk[m] = x[mR — k\. (5.43) 

k - o 




Fig. 5.34. R channel filter bank, with a small amount of overlapping. 




212 



5. Multirate Signal Processing 




Fig. 5.35. (a) Analysis DFT filter bank for channel k. (b) Complete analysis DFT 
filter bank. 

If we now substitute (5.42) into (5.41), we find that all bandpass filters h r [n] 
share the same polyphase filter hk[n], while the “twiddle factors” for each 
filter are different. This structure is shown in Fig. 5.35a for the r th filter h r [n]. 
It is now obvious that this “twiddle multiplication” for h r [n] corresponds to 
the r th DFT component, with an input vector of xo[n], £i[rc], . . . , xr-i[ti]. 
The computation for the whole analysis band can be reduced to filtering 
with R polyphase filters, followed by a DFT (or FFT) of these R filtered 
components, as shown in Fig. 5.35b. This is obviously much more efficient 
than direct computation using the filter defined in (5.41) (see Exercise 5.6, 
p. 238). 

The polyphase filter bank for the uniform DFT synthesis bank can be 
developed as an inverse operation to the analysis bank, i.e., we can use the 
R spectral components X r [k] as input for the inverse DFT (or FFT), and 
reconstruct the output signal using a polyphase interpolator structure, shown 
in Fig. 5.36. The reconstruction bandpass filter becomes 
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Fig. 5.36. DFT synthesis filter bank. 



f M = | f[n]Wn rn = f[n\e? 2wrn / R . (5.44) 

If we now combine the analysis and synthesis filter banks, we can see that 
the DFT and IDFT annihilate each other, and perfect reconstruction occurs if 
the convolution of the included polyphase filter gives a unit sample function, 
i.e., 

h r [n}*f r [n}= jj n = e d (5.45) 

In other words, the two polyphase functions must be inverse filters of each 
other, i.e., 



H r (z) x F r (z) = z 



F r (z) = 



H r {zY 



where we allow a delay d in order to have causal (realizable) filters. In a 
practical design, these ideal conditions cannot be met exactly by two FIR 
filters. We can use approximation for the two FIR filters, or we can combine 
an FIR and HR, as shown in the following example. 



Example 5.9: DFT Filter Bank 

The lossy integrator studied in Example 4.3 (p. 163) should be interpreted in 
the context of a DFT filter bank with R = 2. The difference equation was 

y[n + 1] = ^y[n\ + a;[n]. (5.46) 

The impulse response of this filter in the z-domain is 



n*) = 



1 - 0.75z- 



(5.47) 
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Hj(z) 




Fig. 5.37. Critically sampled uniform DFT filter bank for R = 2. (a) Analysis 
filter bank, (b) Synthesis filter bank. 



In order to get two polyphase filters, we use a similar scheme as for the “scat- 
tered look- ahead” modification (see Example 4.5, p. 166), i.e., we introduce 
an additional pole/zero pair at the mirror position. Multiplying nominator 
and denominator by (1 + 0.75 z~ l ) yields 



F(z) = 



0.75 «' 



1 -0.75 2 *- 2 



+ *~ 



1 



0.75 2 2- 2 



h 0 ( z2 ) M 22 ) 

= Ho {z 2 ) + z~ 1 H l (z 2 ) , 
which gives the two polyphase filters: 

Hot*) = V TT T = 0.752 -1 + 0.4219 2~ 2 + 0.2373«~ 3 + , 

1 — (J. z~ l 



(5.48) 

(5.49) 

(5.50) 



Hi{z) = x _ Q = 1 +0.5625 2 _1 +0.31642 -2 + ... . (5.51) 

We can approximate these impulse responses with a nonrecursive FIR, but 
to get less than 1% error we must use about 16 coefficients. It is therefore 
much more efficient if we use the two recursive polyphase HR filters defined 
by (5.50) and (5.51). After decomposition with the polyphase filters, we then 
apply a 2-point DFT, which is given by 
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The whole analysis filter bank can now be constructed as shown in 5.37a. 
For the synthesis bank, we first compute the inverse DFT using 
1 l" 

1 -1 * 

In order to get a perfect reconstruction we must find the inverse polyphase 
filter to ho[n] and hi[n]. This is not difficult, because the H r (z) f s are single- 
pole HR filters, and F r {z) = z~ d /H r (z) must therefore be two-tap FIR filters. 
Using (5.50) and (5.51), we find that d — 1 is already sufficient to get causal 
filters, and it is 

Fo[n] = 1 (l — 0.75 2 s _1 ) (5.52) 

Fi[n] = z~ x - 0.75 2 z~ 2 . (5.53) 

The synthesis bank is graphically interpreted in Fig. 5.37b. | 5.9 | 




5.6.2 Two-channel Filter Banks 

Two-channel filter banks are an important tool for the design of general filter 
banks and wavelets. Figure 5.38 shows an example of a two-channel filter 
bank that splits the input x[n] using lowpass (G(z)) and highpass (H(z)) 
“analysis” filters. The resulting signal x[n] is reconstructed using lowpass 
and highpass “synthesis” filters. Between the analysis and synthesis sections 
are decimation and interpolation by 2 units. The signal between the deci- 
mators and interpolators is often quantized, and nonlinearly processed for 
enhancement, or compressed. 

It is common practice to define only the lowpass filter G(z), and to use 
its definition to specify the highpass filter H(z). The construction rule is 
normally given by 

h{n] = ( — l) n fl[n] o-* H(z) = G(—z), (5.54) 

which defines the filters to be mirrored pairs. Specifically, in the frequency 
domain, \H(e^ UJ )\ = |G ? (e J '( a; “ 7r ))|. This is a quadrature mirror filter (QMF) 
bank, because the two filters have mirror symmetry to 7r/2. 

For the synthesis shown in Fig. 5.38, we first use an expander (a sampling 
rate increase of 2), and then two separate reconstruction filters, G(z ) and 
H(z ), to reconstruct x\n\. A challenging question now is, can the input signal 
be perfectly reconstructed, i.e., can we satisfy 

x[n] = x[n — d]? (5.55) 

That is, a perfectly reconstructed signal has the same shape as the original, 
up to a phase (time) shift. Because G(z) and H(z) are not ideal rectan- 
gular filters, achieving perfect reconstruction is not a trivial problem. Both 
filters produce essential aliasing components after the downsampling by 2, as 
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x[n] 



G(z) G(z) 




H(z) H(z) 




Analysis Synthesis 



Fig. 5.38. Two-channel filter bank using Daubechies filter of length-4. 



shown in Fig. 5.38. The simple orthogonal filter bank that satisfies (5.55) is 
attributed to Alfred Haar (circa 1910) [93]. 

Example 5.10: Two-Channel Haar Filter Bank I 

The filter transfer functions of the two-channel QMF filter bank from Fig. 5.39 
are 8 



G(z) = 1 + z~ 1 H(z) = l-z~ 1 

G(z)= |(l + 2 -\) H(z) = ±(-l+z- 1 ). 

Using data found in the table in Fig. 5.39, it can be verified that the 
filter produces a perfect reconstruction of the input. The input sequence 
*[0], #[1], x[2], . . . , processed by G(z) and H(z), yields the sum ;r[rc] + :r[n — 1] 
and difference x[n] — x[n — 1], respectively. The downsampling followed by 
upsampling forces every second value to zero. After applying the synthesis 
filter and combining the output we again get the input sequence delayed by 
one, i.e., £[n] = a;[ra — 1], a perfect reconstruction with d = 1 . I 5.10 I 



In the following we will discuss the general relationships the four filters 
must obey to get a perfect reconstruction. It is useful to remember that 

8 Sometimes the amplitude factors are chosen in such a way that orthonormal 
filters are obtained, i.e., |h[n]| 2 = 1. In this case, the filters have an amplitude 

factor of 1 / a/ 2. This will complicate a hardware design significantly. 
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Fig. 5.39. Two-channel Haar-QMF bank (©1999 Springer Press [5]). 



decimation and interpolation by 2 of a signal s[k] is equivalent to multiplying 
S(z) by the sequence {1, 0,1,0,... ,}. This translates, in the z-domain, to 

%(*) = ^(S(*) + S(-z)). < 5 - 56 ) 

If this signal is applied to the two-channel filter bank, the lowpass path 
g{z) and highpass path X^n(z) become 

Xxg{z) = 1 (X{z)G(z) + X(-z)G(-z)) , (5.57) 

X lt n( z ) = \ (X(z)H(z) + X(-z)H(-z)) . (5.58) 

After multiplication by the synthesis filter G(z) and H(z), and summation 
of the results, we get X(z) as 

X{z) = Xtf G (z)G(z)+Xtf H {z)H{z) 
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= 1 -{g(z)G(z) + H(z)H(z))x(z) (5.59) 

+ ^ (G{-z)G{z) + H(-z)H(z)) X(-z). 

The factor of X(—z) shows the aliasing component, while the term at X{z) 
shows the amplitude distortion. For a perfect reconstruction this translates 
into the following: 

Theorem 5.11: Perfect Reconstruction 

A perfect reconstruction for a two-channel filter bank, as shown in 
Fig. 5.38, is achieved if 

1) G(—z)G(z) + H(—z)H(z) = 0, i.e., the reconstruction is free of alias- 
ing. 

2) G(z)G(z) -P H(z)H(z) = 2 z~ d , i.e., the amplitude distortion has am- 
plitude one. 

Let us check this condition for the Haar filter bank. 

Example 5.12: Two-Channel Haar Filter bank II 

The filters of the two-channel Haar QMF bank were defined by 
G(z) = l + z~ l H(z)= 1-z' 1 

G(z)=^(l + z~ 1 ) H(z) = i(-l + ^ 1 ). 

The two conditions from Theorem 5.11 can be proved with: 

1) G(-z)G(z) + H(-z)H(z) 

~ 2 ^~ Z 1 )( — 1 + 2 \) 

= 2 ^~ z 2 ) "b 9 (( — 1 "L 2 2 )-0 / 

2) G(z)G(z) + H(z)H(z) 

= l(l + ,- 1 ) 2 + i(l-,- 1 )(-l + ^- 1 ) 

— ~ ((1 + 2 z z ) + ( — 1 + 2 z — z ' )) = 2z 



For the proof using Theorem 5.11, it can be noted that the perfect recon- 
struction condition does not change if we switch the analysis and synthesis 
filters. 

In the following we will discuss some restrictions that can be made in the 
filter design to fulfill the condition from Theorem 5.11 more easily. 

First, we limit the filter choice by using the following: 

Theorem 5.13: Aliasing- Free Two-Channel Filter Bank 

A two-channel filter bank is aliasing-free if 

G{-z) = -H(z) and H(-z)=G(z ). (5.60) 




5.6 Filter Banks 219 



This can be checked if we use (5.60) for the first condition of Theorem 5.11. 
Using a length-4 filter, these two conditions can be interpreted as follows: 



0[ n ]={fl'[O],tf[l] 1 tf[2],fif[3]} -4 h[n]= {-5[O],0[l],-0[2],0[3]} 
/ l [n]={/i[0],/i[l],/i[2],/i[3]} g[n]={h[0], -h[l], h[2], -/i[3]}. 

With the restriction of the filters as in Theorem 5.13, we can now simplify 
the second condition in Theorem 5.11. It is useful to define first an auxiliary 
product filter F(z) = G(z)G(z). The second condition from Theorem 5.11 
becomes 



G{z)G{z) + H{-z)H{-z) = F(z) + G{-z)G{-z) = F{z) + F(-z) (5.61) 



and we finally get 

F(z) + F{-z) = 2 z~ d , 



(5.62) 



i.e., the product filter must be a half-band filter. 9 The construction of a perfect 
reconstruction filter bank uses the following three simple steps: 

Algorithm 5.14: Perfect-Reconstruction Two-Channel Filter 

Bank 

1) Define a normalized half-band filter according to (5.62). 

2 ) Factor the filter F(z) in F(z) — G(z)G(z). 

3) Compute H(z) and H(z) using (5.60), i.e., H(z) = — G(— z) and 

H(z)=G(-z). 

We wish to demonstrate Algorithm 5.14 with the following example. To sim- 
plify the notation we will, in the following example, write a combination of 
a length L filter for G(z), and length N for G(z), as an L/N filter. 



Example 5.15: Perfect-Reconstructing Filter Bank Using F3 

The (normalized) half-band filter F3 (Table 5.3, p. 204) of length 7 has the 
following ^-domain transfer function 

F3(z) = ^ (-1 + 9z~ 2 + 16z -3 + 9z~ 4 - z~ 6 ) . (5.63) 

The zeros of the transfer function are at 001-4 = — 1, £05 = 2 + V3 = 3.7321, 
and zo6 = 2 — \/3 = 0.2679 = 1/zos- There are different choices for factoring 
F(z) = G(z)G(z). A 5/3 filter is, for instance, 

a) G(z) = (-l-i~2z~ 1 -j-6z~ 2 -i-2z~ 3 —z~ 4 )/8 and G(z) = (1-F2^ — 1 -F^ — 2 )/2 . 
We may design a 4/4 filter as: 

b) G(z) = 1(1 + z~ 1 ) 3 and G(z) = 1(-1 + 3 z -1 + 3z~ 2 - z~ 3 ). 

Another configuration of the 4/4 configuration uses the Daubechies filter 
configuration, which is often found in wavelet applications and has the form: 

c ) G( z ) = 1 4 J ^ (1 + z 1 ) 2 (-«05 + « 5 and G(z) = - -+^ (l + z _1 ) 2 (— z 0 6 + 
*:*)• 

Figure 5.40 shows these three combinations, along with their pole/zero plots. 

[~~5. 15 | 



For the definition of a half-band filter, see p. 204. 



9 
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Fig. 5.40. Pole/zero plot for different factorization of the half-band filter F3. Upper 
row G(z). lower row G(z). (a) Linear-phase 5/3 hlter. (b) Linear-phase 4/4 filter, 
(c) 4/4 Daubechies filter. 



For the Daubechies filter, the condition H(z ) = — z~ N G(—z~ 1 ) holds in 
addition, i.e., highpass and lowpass polynoms are mirror versions of each 
other. This is a typical behavior in orthogonal filter banks. 

From the pole/zero plots shown in Fig. 5.40, for F(z) = G(z)G(z) the 
following conclusions can be made: 

Corollary 5.16: Factorization of a Half-band Filter 

1) To construct a real filter, we must always group the conjugate sym- 
metric zeros at (zo and Zq) in the same filter. 

2) For linear-phase filters, the pole/zero plot must be symmetrical to the 
unit circle (z = 1). Zero pairs at (z 0 and 1/zo) must be assigned to 
the same filter. 

3) To have orthogonal filters that are mirror polynoms of each other, 
(F(z) = U(z)U(z~ 1 )), all pairs z 0 and l/z 0 must be assigned to 
different filters. 
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Fig. 5 . 41 . Polyphase implementation of the two-channel filter bank. 



We note that some of the above conditions can not be fulfilled at the same 
time. In particular, rules 2 and 3 represent a contradiction. Orthogonal, 
linear-phase filters are, in general, not possible, except when all zeros are 
on the unit circle, as in the case of the Haar filter bank. 

If we classify the filter banks from Example 5.15, we find that configura- 
tions (a) and (b) are real linear-phase filters, while (c) is a real orthogonal 
filter. 

Implementing Two-Channel Filter Banks 

We will now discuss different options for implementing two-channel filter 
banks. We will first discuss the general case, and then special simplifications 
that are possible if the filters are QMF, linear-phase, or orthogonal. We will 
only discuss the analysis filter bank, as synthesis may be achieved with graph 
transposition. 

Polyphase two-channel filter banks. In the general case, with two filters 
G(z) and H(z ), we can realize each filter as a polyphase filter 

H(z) = H 0 (z 2 ) + z~ 1 H l {z 2 ) G(z) = G 0 (z 2 ) + z~ 1 G 1 (z 2 ), (5.64) 

which is shown in Fig. 5.41. This does not reduce the hardware effort (2 L 
multipliers and 2(L — 1) adders are still used), but the design can be run with 
twice the usual sampling frequency, 2/ s . 

These four polyphase filters have only half the length of the original fil- 
ters. We may implement these length L / 2 filters directly or with one of the 
following methods: 

1) Run-length filter using short Winograd convolution algorithms [86], dis- 
cussed in Sect. 5.2.2, p. 184. 

2 ) Fast convolution using FFT (discussed in Chap. 6) or NTTs (discussed 
in Chap. 7). 

3) Using advanced arithmetic concepts discussed in Chap. 3, such as dis- 
tribute arithmetic, reduced adder graph, or residue number system. 
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Fig. 5.42. Two-channel filter bank with polyphase decomposition and fast convo- 
lution using the FFT (©1999 Springer Press [5]). 



Using the fast convolution FFT/NTT techniques has the additional benefit 
that the forward transform for each polyphase filter need only be done once, 
and also, the inverse transform can be applied to the spectral sum of the two 
components, as shown in Fig. 5.42. But, in general, FFT methods only give 
improvements for longer filters, typically, larger than 32; however, the typical 
two-channel filter length is less than 32. 

Lifting. Another general approach to constructing fast and efficient two- 
channel filter banks is the lifting scheme introduced recently by Swelden [94] 
and Herley and Vetterli [95]. The basic idea is the use of cross-terms (called 
lifting and dual-lifting), as in a lattice filter, to construct a longer filter from a 
short filter, while preserving the perfect reconstruction conditions. The basic 
structure is shown in Fig. 5.43. 

Designing a lifting scheme typically starts with the “lazy filter bank,” 
with G(z) = H{z) = 1 and H(z) = G{z) — z _1 . This channel bank fulfills 
both conditions from Theorem 5.11 (p. 218), i.e., it is a perfect reconstruction 
filter bank. The following question arises: if we keep one filter fixed, what are 
filters S(z) and T(z) such that the filter bank is still a perfect reconstruction? 




Fig. 5.43. Two-channel filter implementation using lifting and dual- lifting steps. 
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The answer is important, and not trivial: 

Lifting: G'(z) = G(z) + G(-z)S(z 2 ) for any S(z 2 ). (5.65) 

Dual-Lifting: G'{z) = G{z) + G(-z)T{z 2 ) for any T{z 2 ). (5.66) 

To check, if we substitute the lifting equation into the perfect reconstruction 
condition from Theorem 5.11 (p. 218), and we see that both conditions are 
fulfilled if G(z) and H(z) still meet the conditions of Theorem 5.13 (p. 218) 
for the aliasing free filter bank (Exercise 5.9, p. 238). 

The conversion of the Daubechies length-4 filter bank into lifting steps 
demonstrates the design. 



Example 5. IT: Lifting Implementation of the DB4 Filter 

One filter configuration in Example 5.15 (p. 219) was the Daubechies length-4 
filter [96, p. 195]. The filter coefficients were 
G(z) = 

((1 + Vs) + (3 + Vz )z 4- (3 — Vz )z T (1 — Vz )z ) ^ ^ — 



H(z) = 

(-(1 - Vs) + (3 - Vs)z^ - (3 + Vs )z~ 2 + (1 + V3)2- 3 ) -^=. 

A possible implementation uses two lifting steps and one dual- lifting step. 
The differential equations that produce a two-channel filter bank based on 
the above equation are 

hi[n] = x[2 n + 1] — VZx\2 n\ 

gi[n] = x[2 n] + -^-hi[n] + ^ - h\[n - 1] 

h 2 [n] = hi[n] + g\[n + 1] 



9[n] = 



h[n ] = 



V3 + 1 
V2 

Vs -i 

~VT~ 



9 1 L«J 



h 2 [n\. 



Note that the early decimation and splitting of the input into even x[2n\ and 
odd x[2n — 1] sequences allows the filter to run with 2 f s . This structure can be 
directly translated into hardware and can be implemented using MaxPlusII 
(Exercise 5.10, p. 239). The implementation will use five multiplications and 
four adders. The reconstruction filter bank can be constructed based on graph 
transposition, which is, in the case of the differential equations, a reve rsing 
of the operations and flipping of the signs. | 5.17 | 



Daubechies and Sweldens [97], have shown that any (bi)orthogonal wave- 
let filter bank can be converted into a sequence of lifting and dual-lifting 
steps. The number of multipliers and adders required then depends on the 
number of lifting steps (more steps gives less complexity) and can reach up 
to 50% compared with the direct polyphase implementation. This approach 
seems especially promising if the bit width of the multiplier is small [98]. On 
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Fig. 5.44. Polyphase realization of the two-channel QMF bank (©1999 Springer 
Press [5]). 



the other hand, the lattice-like structure does not allow use of reduced adder 
graph (RAG) techniques, and for longer filters the direct polyphase approach 
will often be more efficient. 

Although the techniques (polyphase decomposition and lifting) discussed 
so far improve speed or size and cover all types of two-channel filters, addi- 
tional savings can be achieved if the filters are QMF, linear-phase, or orthog- 
onal. This will be discussed in the following. 

QMF implementation. For QMF [99] we have found that according to 
(5.54), 

h[n\ = (— 1 ) n g[n\ o— • H(z) = G(—z). (5.67) 

But this implies that the polyphase filters are the same (except the sign), 
i.e., 

Go(z)=H 0 {z) Gi(z) = —Hi(z). (5.68) 

Instead of the four filters from Fig. 5.41, for QMF we only need two filters 
and an additional “Butterfly,” as shown in Fig. 5.44. This saves about 50%. 
For the QMF filter we need: 

L real adders L real multipliers, (5.69) 

and the filter can run with twice the usual input-sampling rate. 

Orthogonal filter banks. An orthogonal filter pair 10 obeys the conjugate 
mirror filter (CQF) [100] condition, defined by 

H(z) = z~ N G(- 2 _1 ). (5.70) 

If we use the transposed FIR filter shown in Fig. 5.45, we need only half 
the number of multipliers. The disadvantage is that we can not benefit from 
polyphase decomposition to double the speed. 

Another alternative is realization of the CQF bank using the lattice filter 
shown in Fig. 5.46. The following example demonstrates the conversion of 
the direct FIR filter into a lattice filter. 

10 The orthogonal filter name comes from the fact that the scalar product of the 
filters, for a shift by two (i.e., ^ <?[&]/}[& — 2/] = 0, k, l £ Z), is zero. 
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Fig. 5.45. Orthogonal two-channel filter bank using the transposed FIR structure. 



Example 5.18: Lattice Daubechies L — 4 Filter Implementation 



One filter configuration in Example 5.15 (p. 219) was the Daubechies length-4 
filter [96, p. 195]. The filter coefficients were 



G(z) 



H{z) 



= (1 + V3) + (3 + V3)z~ l + (3 - V3 )a -2 + (1 - Vs)z~ 3 

4\/2 

= 0.48301 + 0.8365a: -1 + 0.2241a -2 - 0.1294a -3 (5.71) 

= -(1 — V3) + (3 — V3)a -1 - (3 + \/3)a -2 + (1 + V3 )a -3 
~~ 4 1/2 

= 0.1294 + 0.2241a -1 - 0.8365a -2 + 0.48301a -3 . (5.72) 



The transfer function for a two-channel lattice with two stages is 
G(z) — (l + a\0\z~ l — Gt[0]a[l];? — 2 + a[l]^ -3 ) s 

H(z) = (— a[l] — a[0]a[l]^ -1 — a[0]^ -2 + z -3 ) 5 . 

If we now compare (5.71) with (5.73) we find 



1 + \/3 



a[0] = 



3 + V3 



,[!] = 



1 - a/3 



(5.73) 

(5.74) 

(5.75) 



4^2 L ” J 4V2s 1 J 4V2s ' 

We can now translate this structure direct into hardware and implement the 
filter bank with MaxPlusII as shown in the following VHDL 11 code. 




Fig. 5.46. Lattice realization for the orthogonal two-channel filter bank (©1999 
Springer Press [5]). 

11 The equivalent Verilog code db41att i . v for this example can be found in Ap- 
pendix A on page 470. 
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PACKAGE n_bits_int IS — User defined types 

SUBTYPE BITS8 IS INTEGER RANGE -128 TO 127; 

SUBTYPE BITS9 IS INTEGER RANGE -2**8 TO 2**8-l; 
SUBTYPE BITS17 IS INTEGER RANGE -2**16 TO 2**16-1; 
TYPE ARRAY_BITS17_4 IS ARRAY (0 TO 3) OF BITS17; 

END n_bits_int; 

LIBRARY work; 

USE work. n_bits_int .ALL; 



LIBRARY ieee ; 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL ; 

USE ieee . std_logic_unsigned. ALL ; 

ENTITY db41atti IS > Interface 

PORT (elk : IN STD.LOGIC; 

clk2 : OUT STD.LOGIC; 

x_in : IN BITS8 ; 

x_e, x_o : OUT BITS17; 

g, h : OUT BITS9) ; 

END db41att i ; 



ARCHITECTURE flex OF db41atti IS 



TYPE STATE.TYPE IS (even, odd); 
SIGNAL state : 

SIGNAL sx_up, sx_low, x_wait : 
SIGNAL clk_div2 : 

SIGNAL sxaO.up, sxaO_low : 
SIGNAL upO, upl, lowO, lowl : 



STATE.TYPE; 
BITS17 ; 
STD_LOGIC; 
BITS17 ; 
BITS17 ; 



BEGIN 



Multiplex: PROCESS > Split into even and odd 

BEGIN — samples at elk rate 

WAIT UNTIL elk = ’1’ ; 

CASE state IS 
WHEN even => 

— Multiply with 256*s=124 
sx_up <= 4 * (32 * x_in - x_in) ; 

sx_low <= 4 * (32 * x_wait - x_wait) ; 
clk_div2 <= ’1’ ; 
state <= odd; 

WHEN odd => 

x_wait <= x_in; 
clk_div2 <= ’O’ ; 
state <= even; 

END CASE; 

END PROCESS; 

Muitipy a[0] = 1.7321 

sxaO_up <= (2*sx_up - sx_up /4) 
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Fig. 5.47. VHDL simulation of the Daubechies length-4 lattice filter bank. 



- (sx_up /64 + sx_up/256) ; 
sxaO_low <= (2*sx_low - sx_low/4) 

- (sx_low/64 + sx_low/256) ; 

First stage — FF in lower tree 

upO <= sxaO_low + sx_up; 

LowerTreeFF : PROCESS 
BEGIN 

WAIT UNTIL clk_div2 = ’O’; 
lowO <= sx_low - sxaO_up; 

END PROCESS; 

Second stage a [1] =0 . 2679 

upl <= (upO - lowO/4) - (lowO/64 + lowO/256) ; 
lowl <= (lowO + upO/4) + (upO/64 + upO/256) ; 

x_e <= sx_up; — Provide some extra test signals 
x_o <= sx_low; 
clk2 <= clk_div2; 

OutputScale: PROCESS 
BEGIN 

WAIT UNTIL clk_div2 = ’O’; 
g <= upl / 256; 
h <= lowl / 256; 

END PROCESS; 

END flex; 

This VHDL code is a direct translation of the lattice shown in Fig. 5.46. The 
incoming stream is multiplied by s = 0.48 « 124/256. Next, the cross- term 
product multiplications with a[ 0] = 1.73 « (2 — 2 -2 — 2 -6 — 2 -8 ) of the first 
stage are computed. It follows that the stage one additions and the lower 
tree signal must be delayed by one sample. In the second stage, the cross 










228 5. Multirate Signal Processing 




Fig. 5.48. Lattice filter to realize linear-phase two-channel filter bank (©1999 
Springer Press [5]). 



multiplication with a[l] = 0.27 « ( 2~ 2 + 2~ 6 + 2 -8 ) and the final output 
addition are implemented. The design uses 331 LCs and runs at 45.24 MHz. 
The VHDL simulation is shown in Fig. 5.47. The simulation shows the re- 
sponse to an impulse with amplitude 100 at even and odd positions for the 
filters G(z) and H(z), respectively. | 5 .i8 | 



If we compare the size of the lattice with the direct polyphase implemen- 
tation of G(z) shown in Example 5.1 on p. 180 (LCs multiplied by two), we 
note that both designs have about the same size (208 x 2 = 416 LCs, versus 
331 LCs). Although the lattice implementation needs only five multipliers, 
compared with eight multipliers for the polyphase implementation, we note 
that in the polyphase implementation we can use the RAG technique to im- 
plement the coefficients of the transposed filter, while in the lattice we must 
implement single multipliers, which, in general, are less efficient. 

Linear-phase two-channel filter bank. We have already seen in Chap. 3 
that if a linear filter has even or odd symmetry, 50% of multiplier resources 
can be saved. The same symmetry also applies for polyphase decomposition 
of the filters if the filters, have even length. In addition, these filters may run 
at twice the speed. 

If G(z) and H(z) have the same length, another implementation using 
lattice filters can further decrease the implementation effort, as shown in 
Fig. 5.48. Notice that the lattice is different from the lattice used for the 
orthogonal filter bank shown in Fig. 5.46. 

The following example demonstrates how to convert a direct architecture 
into a lattice filter. 

Example 5.19: Lattice for L — 4 Linear-Phase Filter 

One filter configuration in Example 5.15 (p. 219) was a linear-phase filter 
pair, with both filters of length 4. The filters are 



G(z) 


= 1 (l + 3* 1 +3z~ 2 + 1 z~ 3 ) 


(5.76) 


H(z) 


= - (-1 + 3z~ l + 3z~ 2 - 1 z~ 3 ) . 
4 v 1 


(5.77) 



The transfer functions for the two-channel length-4 linear-phase lattice filters 



are: 




5.6 Filter Banks 



229 



Table 5.6. Effort to compute two-channel filter banks if both filter are of length 

L. 



Type 


Number of 
real 

multipliers 


Number of 
real 
adders 


see 

Fig. 


Speed 


Can 

use 

RAG ? 


Polyphase with any coefficients 


Direct FIR filtering 


2 L 


2L-2 


5.41 


2/s 


/ 


Lifting 


« L 


« L 


5.43 


2/s 


— 


Quadrature mirror 


filter (QMF) 










Identical polyphase filter L 


L 


5.44 


2/s 


/ 


Orthogonal filter 


Transposed FIR filter 


L 


2 L - 2 


5.45 


fs 


/ 


Lattice 


L -\- 1 


3L/4 


5.46 


2/s 


— 


Linear-phase filter 


Symmetric filter 


L 


2L - 2 


3.5 


2/s 


/ 


Lattice 


L/2 


3L/2 - 1 


5.48 


2/s 


— 



G(z ) — ((1 + or [0] ) + a[0]z 1 + a[0]z -J- (1 + a[0])z ) s (5.78) 

H(z) = ( — (1 + a[0]) + a[0]z _1 -f a[0]z~ 2 - (1 + a[0])z -3 ) s. (5.79) 

Comparing (5.76) with (5.78), we find 

s = —1/2 a[0] = -1.5. ( 5.80) 



Note that, compared with the direct implementation, only about one quarter 
of the multipliers are required. 

The disadvantage of the linear-phase lattice is that not all linear-phase 
filters can be implemented. Specifically, G(z) must be even symmetric, H(z ) 
must be odd symmetric, and both filters must be of the same length, with 
an even number of samples. 

Comparison of implementation options. Finally, Table 5.6 compares 
the different implementation options, which include the general case and 
special types like QMF, linear-phase and orthogonal. 

Table 5.6 shows the required number of multipliers and adders, the refer- 
ence figure, the maximum input rate, and the structurally important question 
of whether the coefficients can be implemented using reduced adder graph 
technique, or occur as single-multiplier coefficients. For shorter filters, the lat- 
tice structure seems to be attractive, while for longer filters, RAG will most 
often produce smaller and faster designs. Note that the number of multipliers 
and adders in Table 5.6 are an estimate of the hardware effort required for 
the filter, and not the typical number found in the literature for the compu- 
tational effort per input sample in a PDSP/pP solution [86, 101]. 
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Excellent additional literature about two-channel filter banks is available 
(see [84, 98, 101, 102]). 



5.7 Wavelets 

A time-frequency representation of signals processed through transform me- 
thods has proven beneficial for audio and image processing [98, 103, 104]. 
Many signals subject to analysis are known to have statistically constant 
properties for only short time frames (e.g., speech or audio signals). It is 
therefore reasonable to analyze such signals in a short window, compute the 
signal parameter, and slide the window forward to analyze the next frame. If 
this analysis is based on Fourier transforms, it is called a short-term Fourier 
ti'ansform (STFT). 

A short-term Fourier transform (STFT) is formally defined by 

/ CO 

(5.81) 

-CO 

i.e., it slides a window function w(t — r) over the signal x(t), and produces 
a continuous time-frequency map. The window should taper smoothly to 
zero, both in frequency and time, to ensure localization in frequency Af 
and time A t of the mapping. One weight function, the Gaussian function 
{g{t) = e _t ), is optimal in this sense, and provides the minimum (Heisenberg 
principle) product AfA t (i.e., best localization), as proposed by Gabor in 1949 
[105]. The discretization of the Gabor transform leads to the discrete Gabor 
transform (DGT). The Gabor transform uses identical resolution windows 
throughout the time and frequency plane (see Fig. 5.50a). Every rectangle 
in Fig. 5.50a has exactly the same shape, but often a constant Q (i.e., the 
quotient of bandwidth to center frequency) is desirable, especially in audio 
and image processing. That is, for high frequencies we wish to have broadband 
filters and short sampling intervals, while for low frequencies, the bandwidth 
should be small and the intervals larger. This can be accomplished with the 
continuous wavelet transform (CWT), introduced by Grossmann and Morlet 
[106], 
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Fig. 5.49. Frequency distribution for (a) Fourier (constant bandwidth) and (b) 
constant Q. 
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(a) STFT lattice 




(b) Wavelet lattice 




Time t-^ Time t-» 

Fig. 5.50. Time frequency grids for a chirp signal, (a) Short-term Fourier trans- 
form. (b) Wavelet transform. 



CWT(r, /) = J x(t) h dt ' (5.82) 

where /j(/), known from the Heugens principle in physics, is called a small 
wave or wavelet. Some typical wavelets are displayed in Fig. 5.51. 

If we use now as a wavelet 

h(t) = (e> 2wkt - e fc2/2 ) e~ t2 / 2 (5.83) 

we still enjoy the “optimal” properties of the Gaussian window, but now with 
different scales in time and frequency. This so-called Morlet transform is also 
subject to quantization, and is then called the discrete Morlet transformation 
(DMT) [107]. In the discrete case the lattice points in time and frequency are 
shown in Fig. 5.50b. The exponential term e~ k / 2 in (5.83) was introduced 



Morlet wavelet 



Meyer wavelet 





(a) (b) 



Daubechies wavelet 




(c) 



Fig. 5.51. Some typical wavelets from Morlet, Meyer, and Daubechies. 
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Fig. 5.52. Analysis of a chirp signal with ( 
transform. 




200 400 600 800 1000 

.) Discrete Morlet transform, (b) Haar 



such that the wavelet is DC free. The following examples show the excellent 
performance of the Gaussian window. 

Example 5.20: Analysis of a Chirp Signal 

Figure 5.52 shows the analysis of a constant amplitude signal with increasing 
frequency. Such signals are called chirp signals. If we applied the Fourier 
transform we would get a uniform spectrum, because all frequencies are 
present. The Fourier spectrum does not preserve time-related information. 
If we use instead an STFT with a Gaussian window, i.e., the Morlet trans- 
form, as shown in Fig. 5.52a, we can clearly see the increasing frequency. 
But the Gaussian window shows the best localization of all windows. On the 
other hand, with a Haar window we would have less computational effort, 
but, as can be seen from Fig. 5.52b, the Haar window will achieve less precise 
time-frequency localization of the signal. | 5.20 | 



Both DGT and DMT provide good localization by using a Gaussian win- 
dow, but both are computationally intensive. An efficient multiplier-free im- 
plementation is based on two ideas. First, the Gaussian window can be suf- 
ficiently approximated by a convolution of (> 3) rectangular functions, and 
second, single-passband frequency-sampling filters (FSF) can be efficiently 
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Fig. 5.53. Wavelets tree decomposition in three octaves (©1999 Springer Press 

[5])- 



implemented by defining algebraic integers over polynomial rings, as intro- 
duced in [107]. 

In the following, we wish to focus our attention on a newly popular anal- 
ysis method called the discrete wavelet transform , which better exploits the 
auditory and visual human perception mode (i.e., constant Q), and also can 
often be more efficiently computed, using 0(n ) complexity algorithms. 



5.7.1 The Discrete Wavelet Transformation 

A discrete-time version of the analog model leads to the discrete wavelet 
transform (DWT). In practical applications, the DWT is restricted to the 
discrete time dyadic DWT with a — 2, and will be considered in the fol- 
lowing. The DWT achieves the constant Q bandwidth distribution shown in 
Fig. 5.49b and Fig. 5.50b by always applying the two-channel filter bank in 
a filter tree to the lowpass signal, as shown in Fig. 5.53. 

We now wish to focus on what conditions for the CWT wavelet allow it 
to be realized with a two-channel DWT filter bank. We may argue that if 
we sample a continuous wavelet at an appropriate rate (above the Nyquist 
rate), we may call the sampled version a DWT. But, in general, only those 
continuous wavelet transforms that can be realized with a two-channel filter 
bank are called DWT. 
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Closely related to whether a continuous wavelet ip(t) can be realized with 
a two-channel DWT, is the question of whether the scaling equation 

~ n ) (5.84) 

n 

exists, where the actual wavelet is computed with 

w) = h[n]<j)(2t — ra), (5.85) 

n 

where g[n] is a lowpass, and h[n] a highpass filter. Note that <j>(t) and ip(t) 
are continuous functions, while g[n] and h[n] are sample sequences (but still 
may also be HR filters). Note that (5.84) is similar to the self-similarity 
— <j>{at)) exhibited by fractals. In fact, the scaling equation may iterate 
to a fractal, but that is, in general, not the desired case, because most often a 
smooth wavelet is desired. The smoothness can be improved if we use a filter 
with maximal numbers of zeros at i r. 

We consider now backwards reconstruction: we start with the filter g[n], 
and construct the corresponding wavelet. This is the most common case, 
especially if we use the half-band design from Algorithm 5.14 (p. 219) to 
generate perfect reconstruction filter pairs of the desired length and property. 

To get a graphical interpretation of the wavelet, we start with a rectan- 
gular function (box function) and build, according to (5.84), the following 
graphical iteration: 

<^ +1 )(/) = <j)^ k \2t - n). (5.86) 

n 

If this coverges to a stable the (new) wavelet is found. This itera- 

tion obviously converges for the Haar filter {1,1} immediately after the first 
iteration, because the sum of two box functions scaled and added is again a 
box function, i.e., 

r(t) r(2t)+r(2t- 1 ) 

Let us now graphically construct the wavelet that belongs to the filter 
g[n] = {1, 1, 1, 1}, which we will call Hutlet4 [108]. 
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(a) (b) 





(c) (d) 





Fig. 5.54. Iteration steps 1,2,3, and 10 for Hutlet4. (solid line: dotted 

line: <^(2 1 — n ); and ideal Hut-function: dashed) 



Example 5.21: Hutlet of Length-4 

We start with four box functions weighted by g[n\ = {1, 1, 1, 1}. The sum 
shown in Fig. 5.54a is the starting <f>^(t). This function is scaled by two, 
and the sum gives a two-step function. After 10 iterations we already get a 
very smooth trapezoid function. If we now use the QMF relation, from (5.54) 
(p. 215), to construct the actual wavelet, we get the Hut let 4, which has two 
triangles as shown in Fig. 5.55. | 5.21 | 



We note that g[n\ is the impulse response of the moving- average filter, 
and can be implemented as an one-stage CIC filter [109]. Figure 5.55 shows 
all scaling functions and wavelets for this type of wavelet with even length 
coefficients. 

As noted before, the iteration defined by (5.86) may also converge to a 
fractal. Such an example is shown in Fig. 5.56, which is the wavelet for the 
length-5 “moving average filter.” This indicates the challenge of the filter 
selection g[n\: it may converge to a smooth or, totally chaotic function, de- 
pending only on an apparently insignificant property like the length of the 
filter! 

We still have not explained why the two-scale equation (5.84) is so im- 
portant for the DWT. This can be better understood if we rearrange the 
downsampler (compressor) and filter in the analysis part of the DWT, using 
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Hutlet2 Hutlet4 Hutlet6 




-0.5 0 0.5 1 0 1 2 3 0 2 4 



Hutlet8 HutletlO Hutletl2 




Time t Time t Time t 

Fig. 5.55. The Hutlet wavelet family (solid line) and scaling function (dashed 
line) after 10 iterations (©1999 Springer Press [5]). 

the “Noble” relation 

(l M) H(z) = H{z M ) {l M), (5.87) 

which was introduced in Sect. 5.1.1, p. 176. The results for a three-level 
filter bank are shown in Fig. 5.57. If we compute the impulse response of the 
cascade sequences, i.e., 

H(z) <-> di[Ar/2] 

G{z)H{z 2 ) d 2 [k/ 4] 

G(z)G{z 2 )H{z 4 ) d 3 [k/ 8] 

G(z)G{z 2 )G{z 4 ) ^a 3 [k/ 8], 

we find that a 3 is an approximation to the scaling function, while d 3 gives 
an approximation to the mother wavelet, if we compare the graphs with the 
continuous wavelet shown in Fig. 5.51 (p. 231). 

This is not always possible. For instance, for the Morlet wavelet shown in 
Fig. 5.51 (p. 231), no scaling function can be found, and a realization using 
the DWT is not possible. 
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(a) (b) 





(c) (d) 





Fig. 5.56. Iteration step 1,2,4, and 8 for Hutlet5. The sequence converges to a 
fractal! 



Two-channel DWT design examples for the Daubechies length-4 filter 
have already been discussed, in combination with polyphase representation 
(Example 5.1, p. 180), and the lattice implementation of orthogonal filters in 
Example 5.18 (p. 225). 



Exercises 

5.1: Let F(z) = 1 + z~ d . For which d do we have a half-band filter according to 
Definition 5.7 (p. 204)? 

5.2: Let F(z) = 1 + z~ 5 be a half-band filter. 

(a) Draw |F(u;)|. What kind of symmetry does this filter have? 

(b) Use Algorithm 5.14 (p. 219) to compute a perfectly reconstructing real filter 
bank. What is the total delay of the filter bank? 

5.3: LTse the half-band filter F3 from Example 5.15 (p. 219) to build a perfect- 
reconstruction filter bank, using Algorithm 5.14 (p. 219), of length 

(a) 1/7. 

(b) 2/6. 

5.4: How many different filter pairs can be built, using F3 from Example 5.15 
(p. 219), if both filters are 

(a) Complex. 

(b) Real. 
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Fig. 5.57. DWT filter bank rearrange using Noble relations, (a) Transfer function 
in the 2 -domain, (b) Impulse response for the length-4 Daubechies filters. 



(c) Linear-phase. 

(d) Orthogonal filter bank. 

5.5: Use the half-band filter F2 (z) = 1 + 22 -1 + 2 -2 to compute, based on Algorithm 
5.14 (p. 219), all possible perfect-reconstructing filter banks. 

5.6: (a) Compute the number of real additions and multiplications for a direct im- 
plementation of the critically sampled uniform DFT filter bank shown in Fig. 5.33 
(p. 211). Assume the length L analysis and synthesis filters have complex coeffi- 
cients, and the inputs are real valued. 

(b) Assume an FFT algorithm is used that needs (15N log 2 ( N)) real additions and 
multiplications. Compute the total effort for a uniform DFT filter bank, using the 
polyphase representation from Figs. 5.35 (p. 212) and 5.36 (p. 213), for R of length 
L complex filters. 

(c) Using the results from (a) and (b) compute the effort for a critically sampled 
DFT filter bank with L = 64 and R = 16. 

5.7: Use the lossy integrator from Example 5.9 (p. 213) to implement an R = 4 
uniform DFT filter bank. 

(a) Compute the analysis polyphase filter Hk(z ). 

(b) Determine the synthesis filter Fk(z) for perfect reconstruction. 

(c) Determine the 4x4 DFT matrix. How many real additions and multiplications 
are used to compute the DFT? 

(d) Compute the total computational effort of the whole filter bank, in terms of 
real additions and multiplications per input sample. 

5.8: Analyze the frequency response of each Goodman and Carey half-band filter 
from Table 5.3 (p. 204). Zoom in on the passband to estimate the ripple of the 
filter. 
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5.9: Prove the perfect reconstruction for the lifting and dual-lifting scheme from 
(5.65) and (5.66) on p. 223. 



Exercises Using MaxPlusII 

5.10: (a) Implement the Daubechies length-4 filter using the lifting scheme from 
Example 5.17 (p. 223), with 8-bit input and coefficient, and 10-bit output quanti- 
zation. 

(b) Simulate the design with two impulses of amplitude 100, similar to Fig. 5.47 
( p . 227). 

(c) Determine LC utilization and the Registered Performance. 

(d) Compare the lifting design with the direct polyphase implementation (Example 
5.1, p. 180) and with the lattice implementation (Example 5.18, p. 225), in terms 
of size and speed. 

5.11: Use component instantiation of the two designs from Example 5.4 (p. 191) and 
Example 5.6 (p. 199) to compute the difference of the two filter outputs. Determine 
the maximum positive and negative deviation. 

5.12: (a) Use the reduced adder graph design from Fig. 3.11 (p. 128) to build a 
half-band filter F6 (see Table 5.3, p. 204) for 8-bit inputs using MaxPlusII. Use the 
transposed FIR structure (Fig. 3.3, p. Ill) as the filter architecture. 

(b) Verify the function via a simulation of the impulse response. 

(c) Determine size in LCs, and Registered Performance, of the F6 design. 

5.13: (a) Compute the polyphase representation for F6 from Table 5.3, p. 204. 

(b) Implement the polyphase filter F6 with decimation R = 2 for 8-bit inputs with 
MaxPlusII. 

(c) Verify the function via a simulation of the impulse (one at even and one at odd) 
response. 

(d) Determine size in LCs, and Registered Performance, of the polyphase design. 

(e) What are the advantages and disadvantages of the polyphase design, when 
compared with the direct implementation from Exercise 5.12 (p. 239), in terms of 
size and speed. 

5.14: (a) Compute the 8-bit quantized DB4 filters G(z) by multiplication of (5.71) 
with 256 and taking the integer part. Use the programm csd.exe from the CD- 
ROM or the data from Table 2.3, p. 40. 

(bl) Design the filter G(z) only from Fig. 5.45, p. 225 for 9-bit inputs with Max- 
PlusII. Assume that input and coefficient are signed, i.e., only one additional guard 
bit is required for a filter of length 4. 

(b2) Determine size in LCs, and Registered Performance, of the filter G{z). 

(b3) What are the advantages and disadvantages of the CSD design, when com- 
pared with the programmable FIR filter from Example 3.1 (p. Ill), in terms of size 
and speed. 

(cl) Design the filter bank with H(z) and G(z) from Fig. 5.45, p. 225. 

(c2) Determine size in LCs, and Registered Performance, of the filter bank. 

(c3) What are the advantages and disadvantages of the CSD filter bank design, 
when compared with the lattice design from Example 5.18, (p. 225), in terms of 
size and speed. 
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The discrete Fourier transform (DFT) and its fast implementation, the fast 
Fourier transform (FFT), have played a central role in digital signal process- 
ing. 

DFT and FFT algorithms have been invented (and reinvented) in many 
variations. As Heideman et al. [110] pointed out, we know that Gauss used 
an FFT-type algorithm that today we call the Cooley-Tukey FFT. In this 
chapter we will discuss the most important algorithms summarized in Fig. 6.1. 

We will follow the terminology introduced by Burrus [111], who classi- 
fied FFT algorithms simply by the (multidimensional) index maps of their 
input and output sequences. We will therefore call all algorithms that do not 
use a multidimensional index map, DFT algorithms, although some of them, 
such as the Winograd DFT algorithms, enjoy an essentially reduced com- 
putational effort. DFT and FFT algorithms do not “stand alone”: the most 



Without multi 
dimensional index map 




With multi- 
dimensional index map 



Goertzel algorithm 
Blustein chirp-z transform 
Rader algorithm 
Winograd DFT 
Hartley transform 



Cooley-Tuckey FFT 

Decimation in Decimation in 

frequency (DIF) time (DIT) 

Good-Thomas FFT 



Winograd FFT algorithms 



Fig. 6.1. Classifications of DFT and FFT algorithms. 
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efficient implementations often result in a combination of DFT and FFT al- 
gorithms. For instance, the combination of the Rader prime algorithm and 
the Good-Thomas FFT results in excellent VLSI implementations. The lit- 
erature provides many FFT design examples. We find implementations with 
PDSPs and ASICs [112, 113, 114, 115, 116, 117]. FFTs have also been devel- 
oped using FPGAs for 1-D [118, 119, 120] and 2-D transforms [43, 121]. 

We will discuss in this chapter the four most important DFT algorithms 
and the three most often used FFT algorithms, in terms of computational 
effort, and will compare the different implementation issues. At the end of the 
chapter, we will discuss Fourier-related transforms, such as the DCT, which 
is an important tool in image compression (e.g., JPEG, MPEG). We start 
with a short review of definitions and the most important properties of the 
DFT. 

For more detailed study, students should be aware that DFT algorithms 
are covered in basic DSP books [5, 67, 122, 123], and a wide variety of FFT 
books are also available [56, 124, 125, 126, 127, 128]. 



6.1 The Discrete Fourier Transform Algorithms 

We will start with a review of the most important DFT properties and will 
then review basic DFT algorithms introduced by Bluestein, Goertzel, Rader, 
and Winograd. 

6.1.1 Fourier Transform Approximations Using the DFT 

The Fourier transform pair is defined by 

/ oo /*oo 

a^e-J 2 ^ d t < ► *(*) = / X{f)e }2nft d /. (6.1) 

-OO J — OO 

The formulation assumes a continuous signal of infinite duration and 
bandwidth. For practical representation, we must sample in time and fre- 
quency, and amplitudes must be quantized. From an implementation stand- 
point, we prefer to use a finite number of samples in time and frequency. This 
leads to the discrete Fourier transform (DFT), where N samples are used 
in time and frequency, according to 



N—l N - 1 

X[k] = Y x[n]e~ }2nkn ^ N = Y *["]Wjv". 

n=0 n=0 



and the inverse DFT (IDFT) is defined as 



x[n 



N 



N— 1 

Y X[k]e> 2wkn/N 

k = 0 



±Yx[k]W^ kn , 

k = 0 



( 6 . 2 ) 



(6.3) 




6.1 The Discrete Fourier Transform Algorithms 243 





Fig. 6.2. Window functions in time and frequency. 



or, in vector/matrix notation 

X=Wx^x = —WX. (6.4) 

N 

If we use the DFT to approximate the Fourier spectrum, we must remember 
the effect of sampling in time and frequency, namely: 

• By sampling in time, we get a periodic spectrum with the sampling fre- 
quency /s. The approximation of a Fourier transform by a DFT is rea- 
sonable only if the frequency components of are concentrated on a 
smaller range than the Nyquist frequency /s/2, as stated in the “Shannon 
sampling theorem.” 

• By sampling in the frequency domain, the time function becomes periodic, 
i.e., the DFT assumes the time series to be periodic. If an TV-sample DFT 
is applied to a signal that does not complete an integer number of cycles 
within an N-sample window, a phenomenon called leakage occurs. There- 
fore, if possible, we should choose the sampling frequency and the analysis 
window in such a way that it covers an integer number of periods of x(2), 
if x(t) is periodic. 

A more practical alternative for decreasing leakage is the use of a window 
function that tapers smoothly to zero on both sides. Such window functions 
were already discussed in the context of FIR filter design in Chap. 3 (see 
Table 3.2, p. 119). Figure 6.2 shows the time and frequency behavior of some 
typical windows [89, 129]. 

An example illustrates the use of a window function. 

Example 6.1: Windowing 

Figure 6.3a shows a sinusoidal signal that does not complete an integer num- 
ber of periods in its sample window. The Fourier transform of the signal 
should ideally include only the two Dirac functions at =La;o, as diplayed in 
Fig. 6.3b. Figures 6.3c and d show the DFT analysis with different windows. 
We note that the analysis with the box function has somewhat more ripple 
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Fig. 6.3. Analysis of periodic function through the DFT, using window functions. 



than the analysis with the Hanning window. An exact analysis would also 
show that the main lope width with Hanning analysis is larger than the width 
achieved with the box function, i.e., no window. | s.i | 



6.1.2 Properties of the DFT 

The most important properties of the DFT are summarized in Table 6.1. 
Many properties are identical with the Fourier transform, e.g., the transform 
is unique (bijective), the superposition applies, and real and imaginary parts 
are related through the Hilbert transform. 

The similarity of the forward and inverse transform leads to an alternative 
inversion algorithm. Using the vector/matrix notation (6.4) of the DFT 

X = Wx <r>x = — W*X, (6.5) 

N 

we can conclude 

X = 4 (W*X)* = WX*, (6.6) 

i.e., we can use the DFT of X * scaled by 1/N to compute the inverse DFT. 
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Table 6.1. Theorems of the DFT 



Theorem 


:r[n] 


X[k] 


Transform 


:r[n] 


N J2 x[n]e- j27rnk/N 

n = 0 


Inverse Transform 


J? E* X[k]e ]2 ™ k/N 


x[fc] 


Superposition 


si£i[n] + S2X 2 [n] 


siX l [k] + s 2 X 2 [k] 


Time reversal 


x[— n] 


X[-k] 


Conjugate complex 
Split 


:r*[n] 


X*[-k] 


Real part 


^R(x[n]) 


(X[k] + X’[-k])/2 


Imaginary part 




(X[k] + X*[-k])/(2j) 


Real even part 


x e [n] = (:r[n] + x[— n])/2 


*(X[fc]) 


Real odd part 


;r 0 [A] = (:r[rc] — x[—n])/2 




Symmetry 


X[n] 


Nx[—k] 


Cyclic 

convolution 


:r[n] © f[rt\ 


X[k]F[k] 


Multiplication 


^[n] x f[n] 


jfX[k ] ® F[k] 


Periodic shift 


x[n — d mod N] 


X[k]e~ j2 * dk/N 


Parseval 


N- 1 


N- 1 


theorem 


E I^HI 2 


v E l*MI 2 




n = 0 


k = 0 



DFT of a Real Sequence 

We now turn to some additional computational savings for DFT (and FFT) 
computations, when the input sequence is real. In this case, we have two 
options: we can compute with one Appoint DFT the DFT of two Appoint 
sequences, or we can compute with an Appoint DFT a length 2N DFT of a 
real sequence. 

If we use the Hilbert property from Table 6.1, i.e., a real sequence has an 
even-symmetric real spectrum and an odd imaginary spectrum, the following 
algorithms can be synthesized [124]. 
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Algorithm 6.2: Length 2 TV Transform with TV-point DFT 



The algorithm to compute the 2TV-point DFT X[k\ = X r [k] + jXi[fc] from 
the time sequence x[n] is as follows: 

1) Build an TV-point sequence y[n\ = x[2n] + jx[2n + 1] with n = 
0, 1, .. .TV — 1. 



2) Compute y[n\ o— •Y[k] = Y r [k] + jYi[&]. where 3?(Y[&]) = Y r [k] is the 
real and S"(Y[&]) = Y[[k] is the imaginary part of Y[Ar], respectively. 

3) Compute 



A- rW = a±2Hzfl + cos (lrl . /JV) 



- sin ( nk/N ) 



Yr [k\ - Y r [-k } 



A- iW = ai-M - m+xi=3 



— cos (irk/N) — 
with k = 0, 1, . . .TV — 1. 



- Y T [-k] 
2 



The computational effort, therefore, besides an TV-point DFT (or FFT), is 4 
TV real additions and multiplications, from the twiddle factors ± exp(j7rAr/TV). 

To transform two length TV sequences with a length TV DFT, we use the 
fact (see Table 6.1) that a real sequence has an even spectrum, while the 
spectrum of a purely imaginary sequence is odd. This is the basis for the 
following algorithm. 

Algorithm 6.3: Two Length TV Transforms with one TV-point 

DFT 

The algorithm to compute the TV-point DFT °— •G[k] and 

h[n] o— • H[k] is as follows: 

1) Build an TV-point sequence y[n] = h[n]+jg[n] with n — 0, 1, . . .TV — 1. 

2) Compute y[n] o— •Y[k] = Y r [k] + jTi[Ar], where 3?(Y[AT]) = Y r [k] is the 
real and Q(Y[fc]) = Y[[k] is the imaginary part of Y[k], respectively. 

3) Compute, finally 

rjriq _ ^r[^] + Y r [—k] Y[[k] — Yi[—k] 

urn- 2 +j 2 

YM + n-k] .Y r [k]-Y r [-k] 

G i k ] = 2 J 2 ’ 

with k — 0, 1, . . .TV — 1. 

The computational effort, therefore, besides an TV-point DFT (or FFT), is 2 
TV real additions, to form the correct two TV-point DFTs. 



Fast Convolution Using DFT 

One of the most frequent applications of the DFT (or FFT) is the computa- 
tion of convolutions. As with the Fourier transform, the convolution in time is 
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Fig. 6.4. Real convolution using a complex FFT [56]. 



done by multiplying the two transformed sequences: the two time sequences 
are transformed in the frequency domain, we compute a (scalar) pointwise 
product, and we transform the product back into the time domain. The main 
difference, compared with the Fourier transform, is that now the DFT com- 
putes a cyclic , and not a linear, convolution. This must be considered when 
implementing fast convolution with the FFT. This leads to two methods 
called “overlap save’’ and “overlap add.” In the overlap save method, we ba- 
sically discharge the samples at the border that are corrupted by the cyclic 
convolution. In the overlap add method, we zero-pad the filter and signal 
in such a way that we can directly add the partial sequences to a common 
product stream. 

Most often the input sequences for the fast convolution are real. An effi- 
cient convolution may therefore be accomplished with a real transform, such 
as the Hartley transform discussed in Exercise 6.15, p. 285. We may also con- 
struct an FFT-like algorithm for the Hartley transform, and can get about 
twice the performance compared with a complex transform [130]. 

If we wish to utilize an available FFT program, we may use one of the 
previously discussed Algorithms, 6.2 or 6.3, for real sequences. An alternative 
approach is shown in Fig. 6.4. It shows a similar approach to Algorithm 6.2, 
where we implemented two Appoint transforms with one iV-point DFT, but 
in this case we use the “real” part for a DFT, and the imaginary part for 
the IDFT, which is needed for the back transformation, according to the 
convolution theorem. 

It is assumed that the DFT of the real- valued filter (i.e., F[k] = F [— &]*) 
has been computed offline and, in addition, in the frequency domain we need 
only N/2 multiplications to compute X[k]F[k]. 

6.1.3 The Goertzel Algorithm 

A single spectral component X[k\ in the DFT computation is given by 
X[k] = *[0] + x[l ]W k N + x[2 }Wj? +... + x[N- 1 }W { N N ~ 1)k . 
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We can combine all x[n] with the same common factor Wfa, and get 
X[k] = *[0] + W k N (*[1] + Wfr (*[2] + . . . + W k N x[N - 1]) . . .)) . 

It can be noted that this results in a possible recursive computation of 
X[k\. This is called the Goertzel algorithm, and is graphically interpreted 
by Fig. 6.5. The computation of y[n] starts with the last value of the input 
sequence ^[TV — 1]. After step three, a spectrum value of X[k] is available at 
the output. 



x[n] -►(+} 



y[n] 



Register 1 





k 

N 



Step 


x\n ] 


Register 1 


y[n] 


0 


*[3] 


0 


x[3] 


1 


*[2] 


W%x[Z] 


x[2] + W k x[ 3] 


2 


*[1] 


W*x\ 2] + W\ k x[ 3] 


x[\] + W%x[2] + Wt k x[Z] 


3 


4°] 


W%x[ 1] 


a;[0] + W%x[ 1] 






+VF| fc a;[2] + W% k x[3] 


+Wf k x\ 2] + Wl k x[ 3] 



Fig. 6.5. The length-4 Goertzel algorithm. 



If we have to compute several spectral components, we can reduce the 
complexity if we combine factors of the type e ^J‘ 2nn / N ^ This will result in 
second-order systems having a denominator according to 



z 2 — 2z cos 



27r n 
~N~ 



+ 1 . 



All complex multiplications are then reduced to real multiplications. 

In general, the Goertzel algorithm can be attractive if only a few spec- 
tral components have to be computed. For the whole DFT, the effort is of 
order N 2 , and therefore yields no advantage compared with the direct DFT 
computation. 



6.1.4 The Bluestein Chirp-^ Transform 

In the Bluestein chirp- z transform (CZT) algorithms, the DFT exponent nk 
is quadratic expanded to 

nk = — (k — n ) 2 / 2 + n 2 / 2 + k 2 / 2. 



(6.7) 
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The DFT therefore becomes 
N —1 

X[k] = ir fe2/2 {x[n]W n ^ 2 ^ w~ {k ~ n) ' 12 . (6.8) 

n — 0 

This algorithm is graphically interpreted in Fig. 6.6. This results in the fol- 
lowing 

Algorithm 6.4: Bluestein Chirp-z Algorithm 

The computation of the DFT is done in three steps, namely 

1 ) N multiplication of x\n\ with . 

2) Linear convolution of x[n]W^ ^ 2 * W ^ ^ . 

3) N multiplications with ^ 2 . 

For a complete transform, we therefore need a length N convolution and 
2 N complex multiplications. The advantage, compared with the Rader algo- 
rithms, is that there is no restriction to primes in the transform length N. 
CZT can be defined for every length. 

Narasimha et al. [131] and others have noticed that in the CZT algorithm 
many coefficients of the FIR filter part are trivial or identical. For instance, 
the length-8 CZT has an FIR filter of length 16, but there are only four 
different complex coefficients as graphically interpreted in Fig. 6.7. These 
four coefficients are 1, j, and ±e 22 5 , i.e., we have only two nontrivial real 
coefficients to implement. 

It may be of general interest what the maximum DFT length for a fixed 
number Cat of (complex) coefficients is. This is shown in the following table. 



DFT 

length 


8 


12 


16 


24 


40 


48 


72 


80 


120 


144 


168 


180 


240 


360 


504 


Cn 


4 


6 


7 


8 


12 


14 


16 


21 


24 


28 


32 


36 


42 


48 


64 



As mentioned before, the number of different complex coefficients does not 
directly correspond to the implementation effort, because some coefficients 



x[n] 






Linear 

convolution 






X[k] 



exp(-j7i n A 2/N) 



exp(-j7t k A 2/N) 



Premulitplication 
with chirp signal 



Postmultiplication 
with chirp signal 



Fig. 6.6. The Bluestein chirp - z algorithm. 
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1.5 



1 



0.5 



E 

- 0.5 



-1 



• _ n 2 / 2 mod 8 

Fig. 6.7. CZT coefficients C(n) = e j2?r § ; n = 1,2,..., 14. 

may be trivial (i.e., d=l or =tj) or may show symmetry. In particular, the 
power-of-two length transform enjoys many symmetries, as can be seen from 
Fig. 6.8. If we compute the maximum DFT length for a specific number of 
nontrivial real coefficients, we find as maximum length transforms: 



DFT length 


10 


16 


20 


32 


40 


48 


50 


80 


96 


160 


192 


sin/cos 


2 


3 


5 


6 


8 


9 


10 


11 


14 


20 


25 



Length 16 and 32 are therefore the maximum length DFTs with only 3 and 
6 real multipliers, respectively! 

In general, power-of-two lengths are popular FFT building blocks, and 
the following table therefore shows, for length N = 2 n , the effort when im- 
plementing the CZT filter in transposed form. 



DFT 

length 


8 


16 


32 


64 


128 


256 


512 


1024 


Cn 


4 


7 


12 


23 


44 


87 


172 


343 


sin / cos 


2 


3 


6 


ii 


22 


43 


86 


171 


CSD 


7 


13 


24 


48 


90 


188 


355 


741 


MAG 


7 


10 


21 


40 


77 


149 


295 


586 


RAG 


7 


11 


13 


23 


41 


63 


95 


169 



CZT with 8 points 




Real(C(n)) 
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Fig. 6.8. Number of complex coefficients and nontrivial real multiplications for the 
CZT. 



The first row shows the DFT length N. The second row shows the total 
number of complex exponential Cn- The worst-case effort for Cn complex 
coefficients is that 2Cn real, nontrivial coefficients must be implemented. 
The actual number of different nontrivial real coefficients is shown in row 
three. We note when comparing rows two and three, that for power-of-two 
lengths the symmetry and trivial coefficients reduce the number of nontrivial 
coefficients. The last three rows show, for CZT DFTs up to length 1024, the 
effort (i.e., number of adders) for an 16-bit (15-bit unsigned plus sign bit) 
coefficient precision implementation, using CSD, MAG, or RAG algorithms 
(discussed in Chap. 2), respectively. We note that the RAG algorithm when 
compared with CSD can essentially reduce the effort for DFT length larger 
than 32. 



6.1.5 The Rader Algorithm 

The Rader algorithm [132, 133] to compute a DFT, 

N— 1 

X[k] = J2 *["]Wjv* k,n £ 7L n \ ovd(W N ) = N (6.9) 

n = 0 
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is defined only for prime length N. We first compute the DC component with 

N-l 

X[0] = £ x[n\. (6.10) 

n = 0 

Because N = p is a prime, we know from the discussion in Chap. 2 (p. 43) 
that there is a primitive element, a generator g , that generates all elements of 
n and k in the field 7L V) excluding zero, i.e., g k E Z p /{0}. We substitute n with 
g n mod N and k through g k mod N , and get the following index transform: 

N - 2 

X[g k mod N] — a?[0] = x[g n mod N]W % ( } (6.11) 

n — 0 

for k E {1,2,3,...,#— 1}. We note that the right side of (6.11) is a cyclic 
convolution, i.e., 



[x\g° mod #], x[g l mod #],..., x[g N 2 mod N]] 



® W N ,W^...,W g N 



N - 2 mod (N-l) 



An example with N = 7 demonstrates the Rader algorithms. 



Example 6.5: Rader Algorithms for N = 7 

For N = 7, we know that g = 3 is a primitive element (see, for instance, [5], 
Table B.7), and the index transform is 

[3 0 ,3\3 2 ,3 3 ,3 4 ,3 5 ] mod 7 = [1, 3, 2, 6, 4, 5], (6.13) 

We first compute the DC component 

6 

X[0] = ^2 x [ n 1 = 37 [0] + X W + ^[2] + x[3] + a; [4] + a; [5] + or[6], 

n = 0 

and in the second step, the cyclic convolution of X[k] — 07 [0] 

[ar[l], a;[3], a;[2], ar[6], ar[4], as[5]] © [W 7 , W 7 3 , W 7 2 , W 7 6 , W 7 4 , W T 5 ], 
or in matrix notation 

[X[l]l rwj W 7 3 W 7 2 Wf W 7 4 wji r x [in r^on 

X[3] W 7 W? W 7 6 W 7 4 Wf W 7 ' ar[3] *[0] 

X[2] _ W? W 7 6 W 7 4 W 7 5 W 7 4 W 7 3 x[2] *[0] ..... 

V[6] W? W 7 4 Wf Wj W 7 3 W? a; [6] + x[0] ' ^• i4 ' 1 

X[4] w 7 4 W 7 Wj W 3 W? Wf x [4] *[0] 

_X[5]J [wf W] W? W? W$ W? J L4 5 ]J L*[°l. 

This is graphically interpreted using an FIR filter in Fig. 6.9. 

We now verify the p — 7 Rader DFT formula, using a test triangular signal 
07 [n] = 10A[n] (i.e., a triangle with step size 10). Directly interpreting (6.14), 
one obtains 

rx[i]i rw 7 ' w 7 3 w? wf w? w 7 5 i r 2 oi noi 

X[3] W? W? W 7 6 W? W 7 5 Wj 30 10 

X[2] _ W 7 2 VF 7 6 FF 7 4 W 7 5 W] W 3 40 10 

X[6] “ W 7 e W 7 4 W 7 5 W 7 W? W? 50 + 10 

X[4] w? W 7 5 W 7 4 W 7 W 7 60 10 

_X[5]j Iwf W] W 7 3 Wj Wf W?J LroJ L i°_ 
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Fig. 6.9. Length p = 7 Rader prime factor DFT implementation. 



"-35+j72" 

-35 + j8 
_ -35+j28 

“ -35 - j72 * 

-35 - jS 
_ -35 - 28 _ 

The value of X[0] is the sum of the time series, which is 10+20H 1-70 = 280. 



In addition, in the Rader algorithms we may use the symmetries of the 
complex pairs e ±i 2kir / N ? k E [0, N/ 2], to build more efficient FIR realizations 
(Exercise 6.5, p. 284). Implementing a Rader prime factor DFT is equiva- 
lent to implementing an FIR filter, which we discussed in Chap. 3. In order 
to implement a fast FIR filter, a fully pipelined DA or the transposed filter 
structure, using the RAG algorithms, is attractive. The RAG FPGA imple- 
mentation is illustrated in the following example. 

Example 6.6: Rader FPGA Implementation 

A RAG implementation of the length- 7 Rader algorithms is accomplished 
as follows. The first step is quantizing the coefficients. Assuming that the 
input values and coefficients are to be represented as a signed 8-bit word, the 
quantized coefficients are: 

fc = 0 1 2 3 4 56 

Re{256 x W 7 k } 256 160 ^57 -231 -231 -57 160 

Im{256 x Wj} 0 -200 -250 -111 111 250 200 

A direct form implementation of all the individual coefficients would (consult- 
ing Table 2.3, p. 40) consume 24 adders for the constant coefficient multipli- 
ers. Using the transposed structure, the individual coefficient implementation 
effort is reduced to 11 adders, by exploiting the fact that several coefficients 
differ only in sign. Optimizing further (reduced adder graph, see Fig. 2.3, 
p. 39), the number of adders reaches a minimum value of 7 (see Factor: 
PROCESS and Coeffs: PROCESS below). This is more than a three times im- 
provement over the direct FIR architecture. The following VHDL code 1 illus- 



1 The equivalent Verilog code rader7.v for this example can be found in Ap- 
pendix A on page 472. 
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trates a possible implementation of the length-7 Rader DFT, using transposed 
FIR filters. 

PACKAGE B_bit_int IS > User defined types 

SUBTYPE W0RD8 IS INTEGER RANGE -2**7 TO 2**7-l; 

SUBTYPE W0RD11 IS INTEGER RANGE -2**10 TO 2**10-1; 

SUBTYPE W0RD19 IS INTEGER RANGE -2**18 TO 2**18-1; 

TYPE ARRAY_W0RD IS ARRAY (0 to 5) OF W0RD19; 

END B_bit_int; 



LIBRARY work; 

USE work . B_bit_int .ALL; 



LIBRARY ieee; 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL ; 

USE ieee. std_logic_unsigned. ALL; 



ENTITY rader7 IS 
PORT ( elk 
x_in 

y_real, y_imag 
END rader7 ; 



> Interface 

IN STD.LOGIC ; 

IN W0RD8 ; 

OUT W0RD11) ; 



ARCHITECTURE flex OF rader7 IS 



SIGNAL 


count 




INTEGER RANGE 0 TO 15; 


TYPE 


STATE. 


TYPE 


IS (Start, Load, 


Run) ; 


SIGNAL 


state 




STATE.TYPE ; 




SIGNAL 


accu 




W0RD11; 


— Signal for X[0] 


SIGNAL 


real , 


imag 


: ARRAY. WORD; 








— Tapped delay line array 


SIGNAL 


x57 , illl, 


x!60, x200, x231 


, x250 : WORD 19 ; 








— The (unsigned) filter coefficients 


SIGNAL 


x5, x25, xllO, xl25, x256 


: W0RD19 ; 








— Auxiliary filter coefficients 


SIGNAL 


x, x_0 


: W0RD8; — Signals 


for x [0] 



BEGIN 



States: PROCESS > State machine for RADER filter 

BEGIN 

WAIT UNTIL elk = >1’; 

CASE state IS 



WHEN Start => 


— Initialization step 


state <= Load; 




count <= 1 ; 




x_0 <= x_in ; 


— Save x[0] 


accu <= 0 ; 


— Reset accumulator for X[0] 


y.real <= 0; 




y.imag <= 0; 




WHEN Load => 


Apply x [5] , x [4] ,x [6] ,x[2] ,x[3] ,x[l] 



IF count = 8 THEN — Load phase done ? 
state <= Run; 
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ELSE 

state <= Load; 
accu <= accu + x ; 

END IF; 

count <= count + 1 ; 

WHEN Run => — Apply again x[5] ,x[4] ,x[6] ,x[2] ,x[3] 

IF count = 15 THEN — Run phase done ? 

y_real <= accu; — X[0] 

y_imag <= 0 ; — Only re inputs i.e. Im(X[0])=0 

state <= Start; — Output of result 

ELSE — and start again 

y_real <= real(O) / 256 + x_0; 
y_imag <= imag(O) / 256; 
state <= Run; 

END IF; 

count <= count + 1 ; 

END CASE; 

END PROCESS States; 

Structure: PROCESS — Structure of the two FIR 
BEGIN — filters in transposed form 

WAIT UNTIL elk = >1 ’ ; 
x <= x_in; 

— Real part of FIR filter in transposed form 



real(O) <= real(l) + xl60 ; — W~1 

real(l) <= real(2) - x231 ; — W~3 

real(2) <= real(3) - x57 ; — W~2 

real (3) <= real (4) + xl60 ; — W~6 

real (4) <= real (5) - x231 ; — W~4 

real(5) <= -x57 ; — W~5 

— Imaginary part of FIR filter in transposed form 
imag(O) <= imag(l) - x200 ; — W~ 1 

imag(l) <= imag(2) - xlll ; — W~3 

imag(2) <= imag(3) - x250 ; — W~2 

imag (3) <= imag(4) + x200 ; — W~6 

imag(4) <= imag(5) + xlll ; — W~4 

imag(5) <= x250; — W~5 

END PROCESS Structure; 



Coeffs: PROCESS — Note that all signals 

BEGIN — are globally defined 

WAIT UNTIL elk = > 1 > ; 

— Compute the filter coefficients and use FFs 
xl60 <= x5 * 32; 

x200 <= x25 * 8; 

x250 <= xl25 * 2; 

x57 <= x25 + x * 32; 

xlll <= xl 10 + x; 

x231 <= x256 - x25 ; 

END PROCESS Coeffs; 



Factors: PROCESS (x, x5, x25) 



— Note that all signals 
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Fig. 6.10. VHDL simulation of a 7-point Rader algorithm. 



BEGIN — are globally defined 

— Compute the auxiliary factor for RAG without an FF 
x5 <= x * 4 + x; 
x25 <= x5 * 4 + x5; 

xl 10 <= x25 * 4 + x5 * 2; 

xl25 <= x25 * 4 + x25; 

x256 <= x * 256; 

END PROCESS Factors; 

END flex; 

The design consists of four blocks of statements within the four PROCESS 
statements. The first - “Stages: PROCESS” - is the state machine, which 
distinguishes the three processing phases, Start, Load, and Run. The second 
- “Structure: PROCESS” - defines the two FIR filter paths, real and imagi- 
nary, respectively. The third item implements the multiplier block using the 
reduced adder graph. The forth block - “Factor: PROCESS” - implements 
the unregistered factors of the RAG algorithm. It can be seen that all co- 
efficients are realized by using six adders and one subtractor. The design 
consumes 486 LCs, and runs at 23.04 MHz Registered Performance. Figure 
6.10 displays simulation results using MaxPlusII for a triangle input sequence 
x\ri\ = {10,20,30,40,50,60,70}. Note that the input and output sequences, 
starting at 950 ns, occur in the permuted order, and negative results appear 
as unsigned positive numbers. Finally, at 1.55 ps, X[0] is forwarded to the 
output and rader7 is ready to process the next input frame. | e.e | 



Because the Rader algorithm is restricted to prime lengths there is less 
symmetry in the coefficients, compared with the CZT. The following table 
shows, for primes length 2 n ±l, the implementation effort of the circular filter 
in transposed form. 
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DFT length 
#Coeffs 


7 


17 


31 


61 


127 


257 


sin/cos 


6 


16 


30 


60 


124 


253 


CSD 


28 


76 


126 


265 


553 


1075 


MAG 


24 


57 


98 


219 


443 


873 


RAG 


14 


36 


45 


73 


131 


253 



The first row shows the cyclic convolution length TV , which is also the number 
of complex coefficients. Comparing row two and the worst case with 2 TV real 
sin/cos coefficients, we see that symmetry and trivial coefficients reduce the 
number of nontrivial coefficients by a factor of 2. The last three rows show the 
effort for a 16-bit coefficient precision implementation using CSD, MAG, or 
RAG algorithms, respectively. Note the advantage of RAG for longer filters. 
It can be seen from the above table that the effort for CSD-type filters can 
be estimated with BN/ 4, where B is the coefficient bit width (16 in this 
table) and N is the filter length. For RAG, the effort (i.e., number of adders) 
is only TV, i.e., a factor Bj 4 improvement over CSD for longer filters (for 
B — 16, a factor 16/4=4 of improvement). For longer filters, RAG needs 
only one additional adder for each additional coefficient, because the already- 
synthesized coefficient produces a “dense” grid of small coefficients. 

6.1.6 The Winograd DFT Algorithm 

The first algorithm with a reduced number of necessary multiplications we 
wish to discuss is the Winograd DFT algorithm. The Winograd algorithm 
is a combination of the Rader algorithm (which translates the DFT into a 
cyclic convolution), and Winograd’s [85] short convolution algorithms, which 
we have already used to implement fast-running FIR filters (see Sect. 5.2.2, 
p. 184). 

The length is therefore restricted to primes or powers of primes. Table 6.2 
gives an overview of the necessary number of arithmetic operations. 

The following example for N = 5 demonstrates the steps to build a Wino- 
grad DFT algorithm. 

Example 6.7: TV = 5 Winograd DFT Algorithm 

An alternative representation of the Rader algorithm, using X[0] instead of 
a;[0], is given by [5] 

4 

X[0] = x[n] = z[0] + x[l] + *[2] + x[3] + x[4] 

n = 0 

X[k] - X[0] 

= [*[1], *[2], *[4], *[3]] ® [Ws - 1 , wi - 1, w 5 4 - 1 , W 5 3 - 1 ] 
k = 1,2, 3, 4. 

If we implement the cyclic convolution of length 4 with a Winograd algorithm 
that costs only five nontrivial multiplications, we get the following algorithm: 
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Table 6.2. Effort for the Winograd DFT with real inputs. Trivial multiplications 
are those by ±1 or ±j. For complex inputs, the number of operations is twice as 
large. 



Block length 


Total number 
of real 

multiplications 


Total number 
nontrivial 
multiplications 


Total number 
of real 
additions 


2 


2 


0 


2 


3 


3 


2 


6 


4 


4 


0 


8 


5 


6 


5 


17 


7 


9 


8 


36 


8 


8 


2 


26 


9 


11 


10 


44 


11 


21 


20 


84 


13 


21 


20 


94 


16 


18 


10 


74 


17 


36 


35 


157 


19 


39 


38 


186 



X[k] = ^4n]e- j2,rfcn/5 fc = 0, 1, . . . , 4 

n = 0 



rvton 




o 

o 

o 

o 

o 


X[l] 




11 1 1 0-1 


X[2] 


= 


11-1 1 1 0 


X[3] 




11-1-1-1 0 


U[4]J 




.11 1-1 0 1_ 



xdiag(l, ^(cos(27r/5) + cos(47r/5)) — 1, 

i(cos(2?r/5) — cos(4tt/5)), j sin(27r/5), 

j(— sin(27r/5) -f sin(47r/5)), ^(sin^Tr/S) + sin(47r/5))) 

"111 1 1 

0 1111 
0 1 - 1-1 1 
X 0 1-1 1-1 

01 0 0-1 

0 0-1 1 0 

The total computational effort is therefore only 5 or 10 real multiplications 
for real or imaginary input sequences :r[ra], respectively. ED 





'4°r 




*[i] 




*[2] 




*[3] 




-4 4 L 



It is quite convenient to use a matrix notation for the Winograd DFT 
algorithm, and so we get 

W Ni = Cl x Bi x A/, 



(6.15) 
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where Ai incorporates the input addition, Bi is the diagonal matrix with the 
Fourier coefficients, and Ci includes the output additions. The only disad- 
vantage is that now it is not as easy to define the exact steps of the short 
convolution algorithms, because the sequence in which input and output ad- 
ditions are computed is lost with this matrix representation. 

This combination of Rader algorithms and a short Winograd convolution, 
known as the Winograd DFT algorithm, will be used later, together with 
index mapping to introduce the Winograd FFT algorithm. This is the FFT 
algorithm with the least number of real multiplications among all known FFT 
algorithms. 



6.2 The Fast Fourier Transform (FFT) Algorithms 

As mentioned in the introduction of this chapter, we use the terminology 
introduced by Burrus [111], who classified all FFT algorithms simply by 
different (multidimensional) index maps of the input and output sequences. 
These are based on a transform of the length N DFT (6.2) 

N —1 

x[k] = ^2 w t (6.16) 

n=0 

into a multidimensional N — JJj Ni representation. It is, in general, sufficient 
to discuss only the two-factor case, because higher dimensions can be built 
simply by iteratively replacing again one of these factors. To simplify our 
representation we will therefore discuss the three FFT algorithms presented 
only in terms of a two-dimensional index transform. 

We transform the (time) index n with 

n = Arii + Bn.2 mod N {q - 1, (6 ' 17) 

where N — N 1 N 2 , and A,B E ^ are constants that must be defined later. 
Using this index transform, a two-dimensional mapping / : C N -> C NlxN > 
of the data is built, according to 

[*[0] a: [1] *[2] • • -x[N — 1]] 

®[ 0 , 0 ] *[ 0 , 1 ] 

*[ 1 , 0 ] *[ 1 , 1 ] 

_x[N 1 - 1,0] x[Nx - 1, 1] ■ • • x[Ni -l ,N 2 - 1]J 
Applying another index mapping k to the output (frequency) domain yields 

k = Ch + Dk 2 mod N {SI*2<S-1 (6 ' 19) 



*[ 0 , N 2 - 1 ] 
x[l,N 2 -l] 



(6.18) 
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where C, D £ 7L are constants that must be defined later. Because the DFT 
is bijective, we must choose A,B,C, and D in such a way that the trans- 
form representation is still unique, i.e., a bijective projection. Burrus [111] 
has determined the general conditions for how to choose A, B, ( 7 , and D for 
specific Ni and N 2 such that the mapping is bijective (see Exercises 6.7 and 
6.8, p. 284). The transforms given in this chapter are all unique. 

An important point in distinguishing different FFT algorithms is the 
question of whether N 1 and N 2 are allowed to have a common factor, i.e., 
gcd(A r i, N 2 ) > 1, or whether the factors must be coprime. Sometimes algo- 
rithms with gcd(Ab , N 2 ) > 1 are referred to as common factor algorithms 
(CFAs), and algorithms with gcd(A7, N 2 ) = 1 are called prime factor algo- 
rithms (PFAs). A CFA algorithm discussed in the following is the Cooley- 
Tukey FFT, while the Good-Thomas and Winograd FFTs are of the PFA 
type. It should be emphasized that the Cooley-Tukey algorithm may indeed 
realize FFTs with two factors, N = A^A^, which are coprime, and that for 
a PFA the factors N\ and N 2 must only be coprime, i.e., they must not be 
primes themselves. A transform of length N = 12 factored with N\ = 4 and 
N 2 = 3, for instance, can therefore be used for both CFA FFTs and PFA 
FFTs! 



6.2.1 The Cooley-Tukey FFT Algorithm 



The Cooley-Tukey FFT is the most universal of all FFT algorithms, because 
any factorization of N is possible. The most popular Cooley-Tukey FFTs 
are those where the transform length N is a power of a basis r, i.e., N = r v . 
These algorithms are often referred to as radix-r algorithms. 

The index transform suggested by Cooley and Tukey (and earlier by 
Gauss) is also the simplest index mapping. Using (6.17) we have A — N 2 
and B — 1, and the following mapping results 



n = N 2 n 1 + n 2 



0 < n 1 < Ni — 1 

0 <n2<N2 — 1 . 



( 6 . 20 ) 



From the valid range of n\ and ?? 2 , we conclude that the modulo reduction 
given by (6.17) need not be explicitly computed. 

For the inverse mapping from (6.19) Cooley and Tukey, choose C — 1 and 
D = N \ , and the following mapping results 



k — k\ + A/1&2 



0 <ki<Ni - 1 

0 < CA^2 < ^A^2 — 1. 



( 6 . 21 ) 



The modulo computation can also be skipped in this case. If we now substi- 
tute n and k in Wf^ k according to (6.20) and (6.21), respectively, we find 



N2 n iki +Ni ^2711^2+^2^1 +Nin 2 k 2 



( 6 . 22 ) 



Because W is of order N = A^A^, it follows that Wj^ 1 = Wn 2 an d W^ 2 = 
Wn 1 - This simplifies (6.22) to 
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W% k = W^l kl W" 2kl W%f 2 . 



If we now substitute (6.23) in the DFT from (6.16) it follows that 

/ \ 



N 2 - 1 



X[k u k 2 ] = J2 w i 



n 2 k 2 

N 2 



n 2 — 0 



N 2 -\ 



N i-l 



w " 2kl J2 x ^’ n ^ w ^ kl 



n 1 = 0 



V A r i -point transform / 









n 2 = 0 



(6.23) 



(6.24) 



(6.25) 



A^-point transform 

We now can define the complete Cooley-Tukey algorithm 

Algorithm 6.8: Cooley-Tukey Algorithm 

An N — N\ A^-point DFT can be done using the following steps: 

1) Compute an index transform of the input sequence according to 

( 6 . 20 ). 

2) Compute the N 2 DFTs of length N\. 

3) Apply the twiddle factors W^ 2kl to the output of the first transform 
stage. 

4) Compute Ni DFTs of length N 2 . 

5) Compute an index transform of the output sequence according to 

( 6 . 21 ). 

The following length- 12 transform demonstrates these steps. 



Example 6.9: Cooley-Tukey FFT for N = 12 

Assume Ah = 4 and Ah = 3. It follows then that n = 3ni +02 and k = 
k 1 + 4/c 2 , and we can compute the following tables for the index mappings: 



n 2 


m 

0 12 3 


k 2 


k 1 

0 12 3 


0 


a;[0] x[3] x [6] rr [9] 


“0“ 


X[0] X[l] X[2] X[3] 


1 


a;[l] x[4] x[7] x[l0] 


1 


X[4] X[5] A[6] X[7] 


2 


x[2\ a; [5] x[8] x[ll] 


2 


X[8] X[9] X[10] X[ll] 



With the help of this transform we can construct the signal flow graph shown 
in Fig. 6.11. It can be seen that first we must compute three DFTs with 
four points each, followed by the multiplication with the twiddle factors, and 
finally we compute four DFTs each having length 3. | 6.9 | 



For direct computation of the 12-point DFT, a total of 12 2 = 144 complex 
multiplications and ll 2 = 121 complex additions are needed. To compute 
the Cooley-Tukey FFT with the same length we need a total of 12 complex 
multiplication for the twiddle factors, of which 8 are trivial (i.e., ±1 or =b j) 
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Fig. 6.11. Cooley-Tukey FFT for N = 12. 



multiplications. According to Table 6.2 (p. 258), the length-4 DFTs can be 
computed using 8 real additions and no multiplications. For the length-3 
DFTs, we need 4 multiplications and 3 additions. If we implement the (fixed 
coefficient) complex multiplications using 3 additions and 3 multiplications 
(see Algorithm 6.10, p. 265), the total effort for the 12-point Cooley-Tukey 
FFT is given by 

3 x 16 + 4x3 + 4x 12 = 108 real additions and 
4x3 + 4x4 = 28 real multiplications. 

For the direct implementation we would need 2 x ll 2 + 12 2 x 3 = 674 real 
additions and 12 2 x 3 = 432 real multiplications. It is now obvious why the 
Cooley-Tukey algorithm is called the “fast Fourier transform” (FFT). 



Radix-r Cooley-Tukey Algorithm 

One important fact that distinguishes the Cooley-Tukey algorithm from 
other FFT algorithms is that the factors for N can be chosen arbitrarily. 
It is therefore possible to use a radix-r algorithm in which N = r s . The most 
popular algorithms are those of basis r — 2 or r = 4, because the necessary 
basic DFTs can, according to Table 6.2 (p. 258), be implemented without 
any multiplications. For r — 2 and S stages, for instance, the following index 
mapping results 
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Fig. 6.12. Decimation-in-frequency algorithm of length 8 for radix- 2. 



n — 2 s 1 ni H \-2n S -i +n s (6.26) 

Ar = -F 2Ar 2 + • • • + 2 s ~ 1 ks. (6.27) 

For S > 2 a common practice is that in the signal flow graph a 2-point 
DFT is represented with a Butterfly , as shown in Fig. 6.12 for an 8-point 
transform. The signal flow graph representation has been simplified by using 
the fact that all arriving arrows at a node are added, while the constant 
coefficient multiplications are symbolized through a factor at an arrow. A 
radix-r algorithm has log r (N) stages , and for each group the same type of 
twiddle factor occurs. 

It can be seen from the signal flow graph in Fig. 6.12 that the compu- 
tation can be done a in-place ” i.e., the memory location used by a butterfly 
can be overwritten, because the data are no longer needed in the next com- 
putational steps. The total number of twiddle factor multiplications for the 
radix-2 transform is given by 

log 2 (A0iV/2, (6.28) 

because only every second arrow has a twiddle factor. 

Because the algorithm shown in Fig. 6.12 starts in the frequency do- 
main to split the original DFT into shorter DFTs, this algorithm is called 
a decimation-in-frequency (DIF) algorithm. The input values typically occur 
in natural order, while the index of the frequency values is in bit-reversed 
order. Table 6.3 shows the characteristic values of a DIF radix-2 algorithm. 

We may also construct an algorithm with decimation in time (DIT). In 
this case, we start by splitting the input (time) sequence, and we find that 
all frequency values will appear in natural order (Exercise 6.10, p. 284). 
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Table 6.3. Radix-2 FFT with frequency decimation. 





Stage 1 


Stage 2 


Stage 3 


Stage log 2 (AT) 


Number of 
groups 


1 


2 


4 


N/2 


Butterflies 
per group 


N/2 


N/4 


N/8 


1 


Increment 
exponent 
twiddle factors 


1 


2 


4 


N/2 



The necessary index transform for index 41, for an radix-2 and radix-4 
algorithm, is shown in Fig. 6.13. For a radix-2 algorithm, a reversing of the 
bit sequence, a bitreverse , is necessary. For a radix-4 algorithm we must first 
build “digits” of two bits, and then reverse the order of these digits. This 
operation is called digitreverse . 



Bitreverse R=2 Digitreverse R=4 

X[41] 10 10 0 1 Original X[41] 







X[37] 10 0 10 1 Reversed ^JO-JO. X[26] 



Fig. 6.13. Bitreverse and digitreverse. 



Radix-2 Cooley-Tukey Algorithm Implementation 

A radix-2 FFT can be efficiently implemented using a butterfly processor 
which includes, besides the butterfly itself, an additional complex multiplier 
for the twiddle factors. 

A radix-2 butterfly processor consists of a complex adder, a complex sub- 
traction, and a complex multiplier for the twiddle factors. The complex mul- 
tiplication with the twiddle factor is often implemented with four real mul- 
tiplications and two add/subtract operations. However, it is also possible to 
build the complex multiplier with only three real multiplications and three 
add/subtract operations, because one operand is precomputed. The algorithm 
works as follows: 
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Algorithm 6.10: Efficient Complex Multiplier 

The complex twiddle factor multiplication ij+j/ = ( X + jT) x (C +j S) 
can be simplified, because C and S are precomputed and stored in a table. 

It is therefore also possible to store the following three coefficients 

C, C + S , and C-S. (6.29) 

With these three precomputed factors we first compute 

E = X-Y, and then Z = C x E = C x (X - Y). (6.30) 

We can then compute the final product using 

R=(C-S)xY + Z (6.31) 

I = (C + 5) x X-Z. (6.32) 

To check: 

R= (C — S)Y + C(X — Y) 

= CY - SY + CX - CY = CX - SY / 

/ = (C + S)X -C{X -Y) 

= CX + SX — CX + CY = CY + SX. / 

The algorithm uses three multiplications, one addition, and two subtractions, 
at the cost of an additional, third table. 

The following example demonstrates the implementation of this twiddle 
factor complex multiplier. 

Example 6.11: Twiddle Factor Multiplier 

Let us first choose some concrete design parameters for the twiddle factor 
multiplier. Let us assume we have 8-bit input data, the coefficients should 
have 8 bits (i.e., 7 bits plus sign), and we wish to multiply by e^ 9 = e^ 20 . 
Quantized to 8 bits, the twiddle factor becomes C + }S = 128 x e* 7r ^ 9 = 
121 + j39. If we use an input value of 70 + j50, then the expected result is 
(70+j50)e j7r/9 = (70+j50)(121 +j39)/128 

= (6520 +j8780)/128 = 50 +j68. 

If we use Algorithm 6.10 to compute the complex multiplication, the three 
factors become: 

C = 121, C + S = 160, and C-S = 82. 

We note from the above that, in general, the tables C + S and C — S must 
have one more bit of precision than the C and S tables. 

The following VHDL code 2 implements the twiddle factor multiplier. 
LIBRARY 1pm; 

USE 1pm. lpm_components. ALL; 

LIBRARY ieee ; 

USE ieee . std_logic_1164 .ALL; 

USE ieee . std_logic_arith. ALL; 

ENTITY ccmul IS 

GENERIC (W2 : INTEGER := 17; — Multiplier bit width 



2 The equivalent Verilog code ccmul . v for this example can be found in Ap- 
pendix A on page 474. 
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Wl : INTEGER := 9; — Bit width c+s sum 

W : INTEGER := 8) ; — Input bit width 

PORT (elk : STD.LOGIC; — Clock for the output register 
x.in, y_in, c_in — Inputs 

: IN STD_L0GIC_VECT0R(W-1 DOWNTO 0) ; 
cps_in, cms_in — Inputs 

: IN STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 
r_out, i_out — Results 

: OUT STD_L0GIC_VECT0R(W-1 DOWNTO 0)); 

END ccmul; 

ARCHITECTURE flex OF ccmul IS 

SIGNAL x, y, c : STD_L0GIC_VECT0R(W-1 DOWNTO 0); 

— Inputs and outputs 
SIGNAL r, i, emsy, cpsx, xmyc — Products 

: STD_L0GIC_VECT0R(W2-1 DOWNTO 0) ; 
SIGNAL xmy, cps, ems, sxtx, sxty — x-y etc. 

: STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 

BEGIN 



X 


<= 


x.in; 


— X 






y 


<= 


y_in; 


— j * 


y 




c 


<= 


c.in; 


— cos 






cps 


<= 


cps.in; 


— cos 


+ 


sin 


ems 


<= 


ems. in; 


— cos 


- 


sin 



PROCESS 
BEGIN 

WAIT UNTIL clk= ’ 1 ’ ; 
r.out <= r(W2-3 DOWNTO W-l) ; 
i_out <= i (W2-3 DOWNTO W-l); 

END PROCESS; 

ccmul with 3 mul . and 

sxtx <= x(x’high) & x; 
sxty <= y(y’high) & y; 

sub_l: lpm_add_sub — Sub: x-y; 

GENERIC MAP ( LPM_WIDTH => Wl, LPM.DIRECTION => "SUB", 
LPM.REPRESENTATION => "SIGNED") 

PORT MAP (dataa => sxtx, datab => sxty, result => xmy) ; 

mul_l: lpm_mult — Multiply (x-y)*c = xmyc 

GENERIC MAP ( LPM.WIDTHA => Wl, LPM_WIDTHB => W, 
LPM.WIDTHP => W2 , LPM.WIDTHS => W2, 
LPM.REPRESENTATION => "SIGNED") 

PORT MAP ( dataa => xmy, datab => c, result => xmyc); 

mul_2: lpm_mult — Multiply (c-s)*y = emsy 

GENERIC MAP ( LPM.WIDTHA => Wl , LPM.WIDTHB => W, 
LPM.WIDTHP => W2 , LPM.WIDTHS => W2, 
LPM.REPRESENTATION => "SIGNED") 

PORT MAP ( dataa => ems, datab => y, result => emsy); 



— Scaling and FF 
— for output 

3 add/sub 

— Possible growth for 

— sub.l -> sign extension 
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Fig. 6.14. VHDL simulation of a twiddle factor multiplier. 



mul_3: lpm_mult — Multiply (c+s)*x = cpsx 

GENERIC MAP ( LPM.WIDTHA => W1 , LPM.WIDTHB => W, 

LPM.WIDTHP => W2 , LPM.WIDTHS => W2, 
LPM.REPRESENTATION => "SIGNED") 

PORT MAP ( dataa => cps, datab => x, result => cpsx); 

sub_2: lpm_add_sub — Sub: i <= (c-s)*x - (x-y)*c; 

GENERIC MAP ( LPM.WIDTH => W2, LPM.DIRECTION => "SUB", 
LPM.REPRESENTATION => "SIGNED") 

PORT MAP ( dataa => cpsx, datab => xmyc, result => i) ; 

add_l: lpm_add_sub — Add: r <= (x-y)*c + (c+s)*y; 

GENERIC MAP ( LPM.WIDTH => W2, LPM.DIRECTION => "ADD", 
LPM.REPRESENTATION => "SIGNED") 

PORT MAP ( dataa => cmsy, datab => xmyc, result => r) ; 

END flex; 

The twiddle factor multiplier is implemented using component instantiations 
of three lpm_mult and three lpm_add_sub modules. The output is scaled such 
that it has the same data format as the input. This is reasonable, because 
multiplication with a complex exponential e 3 ^ should not change the mag- 
nitude of the complex input. To ensure short latency (for an in-place FFT), 
the complex multiplier only has output registers, with no internal pipeline 
registers. With only one output register, it is impossible to determine the 
Registered Performance of the design, but from the simulation results in 
Fig. 6.14, it can be estimated. The design uses 493 LCs and may run faster, 
if the lpm_mult components can be pipelined (see Fig. 2.15, p. 63). | e.n | 



An in-place implementation, i.e., with only one data memory, is now pos- 
sible, because the butterfly processor is designed without pipeline stages. If 
we introduce additional pipeline stages (one for the butterfly and three for 
the multiplier) the size of the design will increase insignificantly (see Exer- 
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cise 6.23, p. 287), however, the speed increases significantly. The price for 
this pipeline design is the cost for extra data memory for the whole FFT, be- 
cause data read and write memories must now be separated, i.e., no in-place 
computation can be done. 

Using the twiddle factor multiplier introduced above, it is now possible 
to design a butterfly processor for a radix-2 Cooley-Tukey FFT. 

Example 6.12: Butterfly Processor 

To prevent overflow in the arithmetic, the butterfly processor computes the 
two (scaled) butterfly equations 

Die Tj X Dim = ((A re + j X Aim ) T (B Te T j X Dim ) ) /2 

Ere T j X Eim — ((A re T j X Ai m ) (7?re "F j X 7?i m )) /2 

Then the temporary result E re + j x E im must be multiplied by the twiddle 
factor. 

The VHDL code 3 of the whole butterfly processor is shown in the following. 

LIBRARY 1pm; 

USE lpm. lpm_ components . ALL ; 



LIBRARY ieee ; 

USE ieee . std_logic_1164. ALL; 
USE ieee . std_logic_arith. ALL ; 



PACKAGE mul_package IS — User defined components 
COMPONENT ccmul 

GENERIC (W2 : INTEGER := 17; — Multiplier bit width 

tfl : INTEGER := 9; — Bit width c+s sum 

W : INTEGER := 8) ; — Input bit width 

PORT 

(elk : IN STD_L0GIC; — Clock for the output register 
x_in, y_in, c_in: IN STD_L0GIC_VECT0R(W-1 DOWNTO 0); 

— Inputs 

cps_in, cms.in : IN STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 

— Inputs 

r.out, i_out : OUT STD_L0GIC_VECT0R(W-1 DOWNTO 0)); 

— Results 



END COMPONENT; 
END mul_package ; 



LIBRARY work; 

USE work .mul_package .ALL; 



LIBRARY ieee; 

USE ieee . std_logic_1164 .ALL; 
USE ieee . std_logic_arith. ALL ; 

LIBRARY lpm; 

USE lpm. lpm_ components. ALL; 



The equivalent Verilog code bfproc.v for this example can be found in Ap- 
pendix A on page 476. 



3 
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LIBRARY ieee ; 

USE ieee . std_logic_1164. ALL; 

USE ieee . std_logic_arith. ALL; 
USE ieee . std_ logic .unsigned. ALL; 



ENTITY bfproc IS 
GENERIC (W2 : 



INTEGER 



W1 



INTEGER 



W 



INTEGER 



= 17; 
= 9; 

= 8) ; 



— Multiplier bit width 

— Bit width c+s sum 

— Input bit width 



PORT 

(elk : STD.LOGIC; 

Are_in, Aim_in, c_in, — 8 bit inputs 

Bre_in, Bim_in : IN STD_L0GIC_VECT0R(W-1 DOWNTO 0) ; 

cps_in, cms.in : IN STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 



— 9 bit coefficients 
Dre_out , Dim_out, — 8 bit results 

Ere.out, Eim_out : OUT STD_L0GIC_VECT0R(W-1 DOWNTO 0)); 
END bfproc; 



ARCHITECTURE flex OF bfproc IS 



SIGNAL 


dif _ 


.re, dif_im — Bf out 


SIGNAL 


Are , 


Aim, 


: STD_L0GIC_VECT0R(W-1 DOWNTO 0) ; 
Bre, Bim : INTEGER RANGE -128 TO 127; 


SIGNAL 


c 




— Inputs as integers 
: STD_L0GIC_VECT0R(W-1 DOWNTO 0) ; 


SIGNAL 


cps, 


ems 


— Input 

: STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 


SIGNAL 


Cre , 


Cim 


— Coeff in 

: STD_L0G IC_ VECTOR (W-l DOWNTO 0); 








— Results 



BEGIN 



PROCESS — Compute the additions of the butterfly using 
BEGIN — integers and store inputs in flip-flops 

WAIT UNTIL elk = >1’ ; 

Are <= CONV_INTEGER(Are_in) ; 

Aim <= CONV_INTEGER(Aim_in) ; 

Bre <= CONV_INTEGER(Bre_in) ; 

Bim <= CONV_INTEGER(Bim_in) ; 

c <= c_in; — Load from memory cos 

cps <= cps_in; — Load from memory cos+sin 

ems <= cms_in; — Load from memory cos-sin 

Dre.out <= C0NV_STD_L0GIC_VECT0R( (Are + Bre )/2, W) ; 
Dim.out <= C0NV_STD_L0GIC_VECT0R ( (Aim + Bim )/2, W) ; 
END PROCESS; 

— No FF because butterfly difference "diff" is not an 
PROCESS (Are, Bre, Aim, Bim) — output port 

BEGIN 

dif _re <= C0NV_STD_L0GIC_VECT0R(Are/2 - Bre/2, 8); 
dif _im <= C0NV_STD_L0GIC_VECT0R(Aim/2 - Bim/2, 8); 
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Fig. 6.15. VHDL simulation of a radix-2 butterfly processor. 



END PROCESS; 

Instantiate the complex twiddle factor multiplier 

ccmul_l: ccmul — Multiply (x+jy) (c+js) 

GENERIC MAP ( W2 => W2 , W1 => Wl, W => W) 

PORT MAP ( elk => elk, x_in => dif_re, y_in => dif_im, 
c_in => c, cps_in => cps , cms_in => ems , 
r_out => Ere_out , i_out => Eim_out) ; 

END flex; 

The butterfly processor is implemented using one adder, one subtraction, and 
the twiddle factor multiplier instantiated as a component. Flip-flops have 
been implemented for input A, B, the three table values, and the output port 
D, in order to have single input /output registered design. The design uses 
531 LCs and runs at 13.56 MHz Registered Performance. Figure 6.15 shows 
the simulation for the zero-pipeline design, for the inputs A = 100 -f- j 1 10, 
B = — 40 -f- jlO, and W = e^ 9 . | 6.12 | 



6.2.2 The Good-Thomas FFT Algorithm 

The index transform suggested by Good [134] and Thomas [135] transforms 
a DFT of length N = N\No into an “actual” two-dimensional DFT, i.e., 
there are no twiddle factors as in the Cooley-Tukey FFT. The price we pay 
for the twiddle factor free flow is that the factors must be coprime (i.e., 
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gcd(Nk , Ni) = 1 for k ^ /), and we have a somewhat more complicated 
index mapping, as long as the index computation is done “online” and no 
precomputed tables are used for the index mapping. 

If we try to eliminate the twiddle factors introduced through the index 
mapping of n and k according to (6.17) and (6.19), respectively, it follows 
from 






w. 



(An 1 +Bn 2 )(Cki+Dk2) 



N 

TTyrAOn ±k i + ADn i k 2 -\-BCk i n 2 -\-B Dn 2 k 2 

VV N 



(6.33) 



jxrN 2 nik\yrTN\k 2 n 2 

— vv jy vv N 



w n^w^ 



that we must fulfill all the following necessary condition at the same time: 



(AD) n = { BC) n = 0 (6.34) 

{AC) n = N 2 (6.35) 

(BD) n = JVi. (6.36) 

The mapping suggested by Good [134] and Thomas [135] fulfills this condition 
and is given by 

A = N 2 B = Ni C=N 2 (Nz 1 )n 1 D = N 1 (N^ 1 ) N2 . (6.37) 



To check: Because the factors AD and BC both include the factor N\ N 2 = N, 
it follows that (6.34) is checked. With gcd(7Vi, iV 2 ) = 1 and a theorem due to 
Euler, we can write the inverse as N 2 -1 mod Ni = 1 mod N\ where 

(j) is the Euler totient function. The condition (6.35) can now be rewritten as 

(AC) n = <JV 2 JV 2 <JV 2 0(JVi) ' 1 )jv 1 )jv. (6.38) 

We can now solve the inner modulo reduction, and it follows with v £ Z and 
mod N — 0 finally 

(AC)n = (N 2 N 2 (N2 < ' Ni ' > ~ 1 + vNi))n = N 2 . (6.39) 

The same argument can be applied for the condition (6.36), and we have 
shown that all three conditions from (6.34)-(6.36) are fulfilled if the Good- 
Thomas mapping (6.37) is used. □ 

In conclusion, we can now define the following theorem 

Theorem 6.13: Good— Thomas Index Mapping 

The index mapping suggested by Good and Thomas for n is 

n = N 2 n 1 + N 1 n 2 modN {]] - 1 (6 ' 40) 

and as index mapping for k results 

k = N 2 {N^) Nl ki + mod N jjj I J -(6.41) 

The transform from (6.41) is identical to the Chinese remainder theorem 
2.2.13 (p. 43). It follows, therefore, that k\ and k 2 can simply be computed 
via a modulo reduction, i.e., ki — k mod TV/. 
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If we now substitute the Good-Thomas index map in the equation for the 
DFT matrix (6.16), it follows that 



X[k u k 2 ] 



n 2 -i 



£ w j 



n 2 k 2 

N 2 



( \ 

N x - 1 

£ x[ ni ,n 2 ]W^ 



n 2 — 0 



n i =0 



\ Ni -point transform/ 



x \ n 2,ki] 



N 2 - 1 

£ W n N f> x[n 2 ,k i], 
n 2 = 0 



TV 2 -point transform 



(6.42) 



(6.43) 



i.e., as claimed at the beginning, it is an “actual” two-dimensional DFT 
transform without the twiddle factor introduced by the mapping suggested 
by Cooley and Tukey. It follows that the Good-Thomas algorithm, although 
similar to the Cooley-Tukey Algorithm 6.8, has a different index mapping 
and no twiddle factors. 

Algorithm 6.14: Good-Thomas FFT Algorithm 

An TV = TViTV 2 -point DFT can be computed according to the following 
steps: 

1) Index transform of the input sequence, according to (6.40). 

2) Computation of TV 2 DFTs of length N\. 

3) Computation of TVi DFTs of length TV 2 . 

4) Index transform of the output sequence, according to (6.41). 

An TV = 12 transform shown in the following example demonstrates the steps. 



Example 6.15: Good-Thomas FFT Algorithm for TV = 12 

Suppose we have TVi = 4 and N 2 = 3. Then a mapping for the input index 
according to n = 3ni + 4ri2 mod 12, and k = 9Au +4fe mod 12 for the output 
index results, and we can compute the following index mapping tables 



n 2 




n 


1 




k 2 
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1 
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XM 
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X[4] 
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X[7] 
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x [8] 


a: [11] 
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x [5] 


2 


X[8] 


V[5] 


X[2] 


*[11] 



Using these index transforms we can construct the signal flow graph shown 
in Fig. 6.16. We realize that the first stage has three DFTs each having four 
points and the second stage four DFTs each of length 3. Multiplication by 
twiddle factors between the stages is not necessary. | 6 ,i 5 | 
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Fig. 6.16. PFA FFT for N = 12. 



6.2.3 The Winograd FFT Algorithm 

The Winograd FFT algorithm [85] is based on the observation that the in- 
verse DFT matrix (6.4) (without prefactor N~ x ) of dimension Ni x jV 2 , with 
gcd(ATi, N 2 ) = 1, i.e., 

JV-1 

x[n] = ^ X[k]W~ nk (6.44) 

k = 0 

x = W* N X (6.45) 

can be rewritten using the Kronecker product 4 with two quadratic IDFT 
matrices each, with dimension N\ and A 2 , respectively. As with the index 
mapping for the Good-Thomas algorithm, we must write the indices of X[k ] 
and x\n\ in a two-dimensional scheme and then read out the indices row by 
row. The following example for N = 12 demonstrates the steps. 

4 A Kronecker product is defined by 
A®B = [a[q j]]B 

r a[0, 0] J 5 ••• a[0,L-l]B 1 



a[K — 1, 0],B • • • a[K — 1, L — 1 ]B 



where A is a K x L matrix. 
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Example 6.16: IDFT using Kronecker Product for TV = 12 

Let Ni = 4 and TV 2 = 3. Then we have the output index transform 
9/ci + 4k 2 mod 12 according to the Good-Thomas index mapping: 



k 2 






k 1 
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X[l] 

m 

X[3] 

X[4] 

X[5] 

X[6] 

X[7] 

X[8] 

X[9] 

X[10] 

L^[n], 

We can now construct a length- 12 IDFT with 
’ *[ 0 ] “ 
x [9] 

x N 

x [3] 
x [4] 
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:r[10] 
x [7] 

4 8 l 
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m 
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k = 



x[o] ' 

X[9] 

X[6] 

X[3] 

X[4] 

X[l] 

X[10] 

X[7] 

X[8] 

X[5] 

X[2] 

-V[ll]_ 



So far we have used the Kronecker product to (re)define the IDFT. Using 
the short-hand notation x for the permuted sequence x , we may use the 
following matrix/ vector notation: 

x = W Nl 0 W N2 X (6.46) 

For these short DFTs we now use the Winograd DFT Algorithm 6.7 (p. 257), 
i.e., 

W N t = Ci x Bi x A*, (6.47) 

where Ai incorporate the input additions, B{ is a diagonal matrix with the 
Fourier coefficients, and Ci includes the output additions. If we now substi- 
tute (6.47) into (6.46), and use the fact that we can change the sequence of 
matrix multiplications and Kronecker product computation (see for instance 
[5, App. D], we get 

W Ni 0 W n 2 = (C 1 x x Ai) 0 (C 2 x B 2 x A 2 ) 

= {c x 0 C 2 ){B 1 0 B 2 )(A 1 0 A 2 ). (6.48) 

Because the matrices Ai and C\ are simple addition matrices, the same ap- 
plies for its Kronecker products, A\ 0 A 2 and C\ 0 C 2 . The Kronecker 
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product of two quadratic diagonal matrices of dimension N\ and N 2 , respec- 
tively, obviously also gives a diagonal matrix of dimension N\N 2 . The total 
number of necessary multiplications is therefore identical to the number of 
diagonal elements oi B — B\ ^ B 2 /\.q., M\M 2) if Mi and M 2 , respectively, 
are the number of multiplications used to compute the smaller Winograd 
DFTs according to Table 6.2 (p. 258). 

We can now combine the different steps to construct a Winograd FFT. 

Theorem 6.17: Winograd FFT Design 

A N = TVi A^-point transform with coprimes Ni and N 2 can be con- 
structed as follows: 

1) Index transform of the input sequence according to the Good-Thomas 
mapping (6.40), followed by a row read of the indices. 

2) Factorization of the DFT matrix using the Kronecker product. 

3) Substitute the length Ni and N 2 DFT matrices through the Winograd 
DFT algorithm. 

4) Centralize the multiplications. 

After successful construction of the Winograd FFT algorithm, we can com- 
pute the Winograd FFT using the following three steps: 

Theorem 6.18: Winograd FFT Algorithm 

1) Compute the preadditions A\ and A 2 . 

2) Compute M\M 2 multiplications according to the matrix B 1 0 B 2 . 

3) Compute post additions according to Ci and C 2 . 

Let us now look at a construction of a Winograd FFT of length-12, in detail 
in the following example. 



Example 6.19: Winograd FFT of Length 12 



To build a Winograd FFT, we have, according to Theorem 6.17, to compute 
the necessary matrices used in the transform. For N\ — 3 and N 2 = 4 we 
have the following matrices: 





- 1 


'1 1 


1 




0 1 


1 


0 


0 1 


-1 











1111 
1-1 1-1 
10-10 
0 10-1 



(6.49) 



B\ 0 B 2 = diag(l, —3/2, a/3/2) G diag(l, 1, 1, — *) (6.50) 



Ci 0 C 2 



10 0 

1 1 * 

1 1 -* 



G 



10 0 0 

0 10 0 

10-1 0 

0 0 1-1 



(6.51) 



Combining these matrices according to (6.48) results in the Winograd FFT 
algorithm. Input and output additions can be realized multiplier free, and 
the total number of real multiplication becomes 2 x 3 x 4 = 24. | 6.19 | 



So far we have used the Winograd FFT to compute the IDFT. If we now 
want to compute the DFT with the help of an IDFT, we can use a technique 
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we used in (6.6) on p. 244 to compute the IDFT with help of the DFT. Using 
matrix/ vector notation we find 



z* = (W* N X )* (6.52) 

** = W N X\ (6.53) 

if W N = [e 2 ^ nk /N] w ith n ,k E Z N is a DFT. The DFT can therefore be 
computed using the IDFT with the following steps: Compute the conjugate 
complex of the input sequence, transform the sequence with the IDFT algo- 
rithm, and compute the conjugate complex of the output sequence. 

It is also possible to use the Kronecker product algorithms, i.e., the Wino- 
grad FFT, to compute the DFT directly. This leads to a slide-modified output 
index mapping, as the following example shows. 

Example 6.20: A 12-point DFT can be computed using the following Kronecker 
product formulation: 
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(6.54) 



The input sequence :r[rc] can be considered to be in the order used for Good-Thomas 
mapping, while in the (frequency) output index mapping for X[k], each first and 
third element are exchanged, compared with the Good-Thomas mapping. | 6.20 | 



6.2.4 Comparison of DFT and FFT Algorithms 

It should now be apparent that there are many ways to implement a DFT. 
The choice begins with the selection of a short DFT algorithm from among 
those shown in Fig. 6.1 (p.241). The short DFT can then be used to develop 
long DFTs, using the indexing schemes provided by Cooley-Tukey, Good- 
Thomas, or Winograd. A common objective in choosing an implementation 
is minimum multiplication complexity. This is a viable criterion when the 
implementation cost of multiplication is much higher compared with other 
operations, such as additions, data access, or index computation. 

Figure 6.17 shows the number of multiplications required for various FFT 
lengths. It can be concluded that the Winograd FFT is most attractive, 
based purely on a multiply complexity criterion. In this chapter, the design of 
N = 4x3 = 12-point FFTs has been presented in several forms. A comparison 




6.2 The Fast Fourier Transform (FFT) Algorithms 277 



Table 6.4. Number of real multiplications for a length- 12 complex input FFT 
algorithm (twiddle factor multiplications by W° are not counted). A complex mul- 
tiplication is assumed to use four real multiplications. 







Index mapping 




DFT 

Method 


Good-Thomas 
Fig. 6.16 
p. 273 


Cooley-Tukey 
Fig. 6.2 

p. 260 


Winograd 
Example 6.16 
p. 274 


Direct 


4 x 12 2 = 4 x 144 = 576 




RPFA 


4(3(4 - l) 2 
+4(3 - l) 2 ) = 172 


4(43 + 6) = 196 




WFTA 


3x0x2 
-f 4 x 2 x 2 = 16 


16 + 4 x 6 = 40 


2 x 3 x 4 = 24 



of a direct, Rader prime factor algorithms, and Winograd DFT algorithms 
used for the basic DFT blocks, and the three different index mappings called 
Good-Thomas, Cooley-Tukey, and Winograd FFT, is presented in Table 6.4. 

Besides the number of multiplications, other constraints must be consid- 
ered, such as possible transform lengths, number of additions, index compu- 
tation overhead, coefficient or data memory size, and run-time code length. 




* * 1 Butterfly 

x x 2 Butterflies 

< <3 3 Butterflies 

★ ★ Good-Thomas FFT 

0 o Winograd FFT 



Fig. 6.17. Comparison of different FFT algorithm based on the number of neces- 
sary real multiplications. 
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Table 6.5. Important properties for FFT algorithms of length AT = ]^[ AT. 



Property 


Cooley-Tukey 


Good-Thomas 


Winograd 


Any transform 
Length 


yes 


no 

g cd(JV*,Ar,) = i 


Maximum 


order of W 


N 


max (AT) 


Twiddle 


factors needed 


yes 


no 


no 


if Multiplications 


bad 


fair 


best 


if Additions 


fair 


fair 


fair 


if Index comput- 


ation effort 


best 


fair 


bad 


Data in-place 


yes 


yes 


no 


Implementation 


small 


can use RPFA, 


small size for 


advantages 


butterfly 


fast, simple 


full parallel, medium- 




processor 


FIR array 


size FFT (< 50) 



In many cases, the Cooley-Tukey method provides the best overall solution, 
as suggested by Table 6.5. 

With FPGAs reaching complexities of more than 1M gates today, full in- 
tegration of an FFT on a single FPGA is viable. Because the design of such 
an FFT block is labor intensive, it most often makes sense to utilize com- 
mercially available “intellectual property” (IP) blocks (sometimes also called 
“virtual components” VCs). See, for instance, the IP partnership programs 
at www.xilinx.com or www.altera.com. The majority of the commercially 
available designs are based on radix-2 or radix-4. 

Some of the published FPGA realizations are summarized in Table 6.6. 
The design by Goslin [120] is based on a radix-2 FFT, in which the butterflies 
have been realized using distributed arithmetic, discussed in Chap. 2. The 
design by Dandalis et al. [136], is based on an approximation of the DFT 
using the so-called arithmetic Fourier transform and will be discussed in 
Sect. 7.1. The ERNS FFT, from Meyer-Base et al. [137], uses the Rader 
algorithm in combination with the number theoretic transform, which will 
also be discussed in Chap. 7. 



6.3 Fourier-related Transforms 

The discrete cosine transform (DCT) and discrete sine transform (DST) are 
not DFTs, but they can be computed using a DFT. However, DCTs and DSTs 
can not be used directly to compute fast convolution, by multiplying the 
transformed spectra and an inverse transform, i.e., the convolution theorem 
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Table 6.6. Comparison of some FPGA FFT implementations [5]. 



Name 


Data 

type 


FFT 

type 


N-point 

FFT 

time 


Clock 

rate 

P 


Internal 

RAM/ 

ROM 


Design 

aim/ 

source 


Xilinx 


8 Bit 


Radix=2 


N = 256 


70 MHz 


no 


573 


FPGA 




FFT 


102.4 ps 


4.8 W 




CLBs 










@3.3 V 




[120] 


Xilinx 


16 Bit 


AFT 


N = 256 


50 MHz 


no 


[136] 


FPGA 






82.48 [is 


15.6 W 




2602 CLBs 










@ 3.3 V 












42.08 ps 


29.5 W 




4922 CLBs 


Xilinx 


12.7 Bit 


FFT 


N = 97 


26 MHz 


no 


1178 


FPGA 




using 


9.24 ps 


3.5 W 




CLBs 


ERNS- 














NTT 




NTT 




@3.3 V 




[137] 



does not hold. The applications for DCTs and DSTs are therefore not as broad 
as those for FFTs, but in some applications, like image compression, DCTs 
are (due to their close relationship to the Kahunen-Loeve transform) very 
popular. However, because DCTs and DSTs are defined by sine and cosine 
“kernels,” they have a close relation to the DFT, and will be presented in this 
chapter. We will begin with the definition and properties of DCTs and DSTs, 
and will then present an FFT-like fast computation algorithm to implement 
the DCT. All DCTs obey the following transform pattern observed by Wang 
[138]: 



X[k] = <— *• ar[n] = ^ X[k\C n / . (6.55) 

n k 

The kernel functions Cjy k , for four different DCT instances, are defined by 

DCT-I: Cj * r ’ k =\/2 /Nc[n\c[k\ cos (nkjjf ) n, k = 0, 1, . . . , N 

DCT-II: C^ k =^/f/N r c[k\ cos (k(n + |)^) n, k = 0, 1, . . . , N — 1 

DCT-III: C n / =^jNc[n] cos (n(k + |)^) n, k = 0, 1, . . . , N — 1 

DCT-I V: C n N ' k cos ({k + \){n+ n, k = 0, 1, . . . , N — 1, 

where c[m] = 1 except c[0] = l/\/2. The DST has the same structure, but the 
cosine terms are replaced by sine terms. DCTs have the following properties: 

1) DCTs implement functions using cosine bases. 

2) All transforms are orthogonal , i.e., C x C l — k[n]I. 

3) A DCT is a real transform, unlike the DFT. 

4) DCT-I is its own inverse. 
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5) DCT-II is the inverse of DCT-III, and vice versa. 

6) DCT-IV is its own inverse. Type IV is symmetric, i.e., C — C l . 

7) The convolution property of the DCT is not the same as the convolution 
multiplication relationship in the DFT. 

8 ) The DCT is an approximation of the Kahunen-Loeve transformation 
(KLT). 

The two-dimensional 8x8 transform of the DCT-II is used most often in 
image compression, i.e., in the H.261, H.263, and MPEG standards for video 
and in the JPEG standard for still images. Because the two-dimensional 
transform is separable into two dimensions, we compute the two-dimensional 
DCT by row transforms followed by column transforms, or vice versa (Ex- 
ercise 6.17, p. 286). We will therefore focus on the implementation of one- 
dimensional transforms. 



6.3.1 Computing the DCT Using the DFT 



Narasimha and Peterson [139] have introduced a scheme describing how to 
compute the DCT with the help of the DFT [140, p. 50]. The mapping of 
the DCT to the DFT is attractive because we can then use the wide variety 
of FFT-type algorithms. Because DCT-II is used most often, we will further 
develop the relationship of the DFT and DCT-II. To simplify the representa- 
tion, we will skip the scaling operation, since it can be included at the end of 
the DFT or FFT computation. Assuming that the transform length is even, 
we can rewrite the DCT-II transform 



N - 1 



X[k} = E x[n] cos 



1 

n - 1 — 

2 



using the following permutation 

y[n ] = x[2n\ and y[N — n — 1] 
for n = 0, 1, . . . , N/2 — 1. 

It follows then that 



x[2 n + 1] 



X[k] = 



N/ 2-1 

£ 

n=0 



y[n] cos ( k(2n 



N( 2-1 

+ E y\. N ~- 

n — 0 



1] COS 



k{ - 2n+ \ )] k 



x w = E y [ r 



k(2n + -) — 
y 2 N 



(6.56) 



If we now compute the DFT of y[n] denoted with Y[k], we find that 



(6.57) 
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X[k] =^{W AN Y[k}) 

= c ° s (|^) ^( y w) - sin 9 ( y [*])• ( 6 - 58 ) 

This can be easily transformed in a C or Mat Lab program (see Exercise 
6.17, p. 286), and can be used to compute the DCT with the help of a DFT 
or FFT. 



6.3.2 Fast Direct DCT Implementation 

The symmetry properties of DCTs have been used by Byeong Lee [141] to 
construct an FFT-like DCT algorithm. Because of its similarities to a radix- 
2 Cooley-Tukey FFT, the resulting algorithm is sometimes referred to as 
the fast DCT or simply FCT. Alternatively, a fast DCT algorithm can be 
developed using a matrix structure [142]. A DCT can be obtained by “trans- 
posing” an inverse DCT (IDCT) since the DCT is known to be an orthog- 
onal transform. IDCT Type II was introduced in (6.55) and, noting that 



X[k] = c[&]X[fc], it follows that 
N—l 

x[n\ = X[k]C n /, n = 0,l,...,N-l. (6.59) 

k = 0 

Decomposing x[n] into even and odd parts it can be shown that x[n\ can be 
reconstructed by two N/2 DCTs, namely 

G[k] = X[2 Ar], (6.60) 

H[k] = X[2k + 1\+X[2k- 1], k = 0,1,.. ., N/2 - 1. (6.61) 

In the time domain, we get 
N/2 — 1 

g[n\= Y G[k]C%f 2 , (6.62) 

k- 0 
N/2 — 1 

h[n] = Y H \-^ C n^ k = 0,l,...,N/2-l. (6.63) 

k = 0 

The reconstruction becomes 

x[n] = g[n\ + 1/(2 C^/ k )h[n\, (6.64) 

x[N — 1 — n] = g[n ] — 1/(2 C 7 ^ ,k )h[n\, (6.65) 

« = 0,1,...,jV/2-1. 



By repeating this process, we can decompose the DCT further. Compar- 
ing (6.62) with the radix-2 FFT twiddle factor shown in Fig. 6.12 (p. 263) 
shows that a division seems to be necessary for the FCT. The twiddle factors 
1/(2 C^ k ) should therefore be precomputed and stored in a table. Such a 
table approach is also appropriate for the Cooley-Tukey FFT, because the 
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Fig. 6 . 18 . 8-point fast DCT flow graph with the short-hand notation c[p] = 
1/(2 cos(p7r/16)). 



“online” computation of the trigonometric function is, in general, too time 
consuming. We will demonstrate the FCT with the following example. 



Example 6.21: A 8-point FCT 

For an 8-point FCT (6.60)-(6.65) become 

G[k] = X[2 fc], (6.66) 

H[k\ = X[2k + 1] + X[2k - 1], k = 0, 1, 2, 3. (6.67) 

and in the time domain we get 

3 

g[n] = (6.68) 

k = 0 
3 

/*H = ^2 H i k ] C i' k ^ n = 0, 1, 2, 3. (6.69) 

k = 0 

The reconstruction becomes 

x[n] = g[n] + 1/(2 C£' k )h[n], (6.70) 

x[N - 1 - n] = g[n] - 1/ (2C™’ k )h[n], n = 0, 1,2,3. (6.71) 



Equations (6.66) and (6.67) form the first stage in the flow graph in Fig. 6.18, 
and (6.70) and (6.71) build the last stage in the flow graph. I 6.21 I 



In Fig. 6.18, the input sequence X[k\ is applied in bit-reversed order. 
The order of the output sequence x[n\ is generated in the following manner: 
starting with the set (0, 1) we form the new set by adding a prefix 0 and 1. 
For the prefix 1, all bits of the previous pattern are inverted. For instance, 
from the sequence 10 we get the two babies 010 and 110 = 101. This scheme 
is graphically interpreted in Fig. 6.19. 
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IDCT input permutation 
X[41] 10 10 0 1 Original 




X[37] 10 0 10 1 Reversed 



IDCT output permutation 




. 000 .. 001..01 1 .. 010 .. 1 1 1 .. 1 10 .. 100 .. 101 . 



Fig. 6.19. Input and output permutation for the 8-point fast DCT. 



Exercises 

6.1: Compute the 3-dB bandwidth, first zero, maximum sidelobe, and decrease per 
octave, for a rectangular and triangular window using the Fourier transform. 

6.2: (a) Compute the cyclic convolution of :r[n] = {3, 1, —1} and f[n] = {2, 1, 5}. 

(b) Compute the DFT matrix W3 for N = 3. 

(c) Compute the DFT of a;[n] = {3, 1, —1} and f[n] — {2, 1, 5}. 

(d) Now compute Y[k] = X[k]F[k], followed by y — for the signals from 

part (c). 

(Note: use a C compiler or MatLab for part (c) and (d).) 

6.3: A single spectral component X[k] in the DFT computation 

X[k] = ar[0] + x[\]W k N + x[2 ]W% k + . . . + x[N - 1]^ _1) * 

can be rearranged by collecting all common factors such that we get 

X[k ] = a;[0] -f- VF^-(a;[l] + l / F^-(a;[2] + . . . + Wj^x[N — 1]) . . .)). 

This results in a possibly recursive computation of X[k]. This is called the Goertzel 
algorithm and is graphically interpreted by Fig. 6.5 (p. 248). The Goertzel algo- 
rithm can be attractive if only a few spectral components must be computed. For 
the whole DFT, the effort is of order N 2 and there is no advantage compared with 
the direct DFT computation. 

(a) Construct the recursive signal flow graph, including input and output register, 
to compute a single X[k] for N = 5. 

For N = 5 and k = 1, compute all registers contents for the following input se- 
quences: 

(b) {20,40,60,80,100}. 

(c) {}20, j'40, j60, j80, jl00}. 

(d) {20 + j20, 40 + j40, 60 + j60, 80 + j80, 100 + jl00}. 

6.4: The Bluestein chirp-;? algorithm was defined in Sect. 6.1.4 (p. 248). This algo- 
rithm is graphically interpreted in Fig. 6.6 (p. 249). 

(a) Determine the CZT algorithms for N = 4. 
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(b) Using C or MatLab, determine the CZT for the triangular sequence x[n] = 
{0, 1,2,3}. 

(c) Using C or MatLab, extend the length to N = 256, and check the CZT results 
with an FFT of the same length. Use a triangular input sequence, ^[n] = n. 

6.5: (a) Design a direct implementation of the nonrecursive filter for the N = 7 
Rader algorithm. 

(b) Determine the coefficients that can be combined. 

(c) Compare the realizations from (a) and (b) in terms of realization effort. 

6.6: Design a length N = 3 Winograd DFT algorithm. 

6.7: (a) Using the two-dimensional index transform n = 3n\ + 2n 2 mod 6, with 
Ah = 2 and N 2 = 3, determine the mapping (6.18) on p. 259. Is this mapping 
bijective? 

(b) Using the two-dimensional index transform n = 2n\ -\-2n 2 mod 6, with Ah = 2 
and Ah = 3, determine the mapping (6.18) on p. 259. Is this mapping bijective? 

(c) For gcd(Ah, Ah) > 1, Burrus [111] found the following conditions such that the 
mapping is bijective: 

I A = qN 2 and B ^ 6Ah and gcd(a, Ah) = gcd (B, N 2 ) = 1 

or 

A / <3 Ah and B = 6Ah and gcd(A, Ah) = gcd(6, Ah) = 1, 



with a, 6 £ ZT Suppose Ah = 9 and Ah = 15. For A = 15, compute all possible 
values for B £ Z 20 . 

6 . 8 : For gcd(Ah,Ah) = 1 , Burrus [111] found that in the following conditions the 
mapping is bijective: 

A = a Ah and/or B = bNi and gcd(A, Ah) = gcd (B, Ah) = 1, (6.72) 

with a, 6 £ 7L. Assume Ah = 5 and Ah = 8. Determine whether the following map- 
pings are possibly bijective index mappings: 

(a) A = 8, B = 5. 

(b) A = 8, B = 10. 

(c) A = 24, £ = 15. 

(d) For A = 7, compute all valid B £ Z 2 o- 

(e) For A = 8, compute all valid B £ Z 2 q. 



6.9: (a) Draw the signal flow graph for a radix-2 DIF algorithm where N = 16. 

(b) Write a C or MatLab program for the DIF radix-2 FFT. 

(c) Test your FFT program with a triangular input x[n\ = n- fjn with n £ [0, A r — 1]. 

6.10: (a) Draw the signal flow graph for a radix- 2 DIT algorithm where N = 8. 

(b) Write a C or MatLab program for the DIT radix- 2 FFT. 

(c) Test your FFT program with a triangular input x[n] = rc+jrc with n £ [0, N — 1]. 

6.11: Compute the index mapping for an N = 16 radix-4 FFT. 

Draw the signal flow graph for the N = 16 radix-4 FFT. 

6.12: Draw the signal flow graph for an N = 12 Good-Thomas FFT, such that no 
crossings occur in the signal flow graph. 

(Hint: Use a 3D representation of the row and column DFTs) 
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6 . 13 : The index transform for FFTs by Burrus and Eschenbacher [143] is given by 
n = N 2 m + Nm 2 mod N ( 6 - 73 ) 

and 

k = N 2 k i + h\k 2 mod N | J <kl<N^ - 1. ( 6 - 74 ) 

(a) Compute the mapping for n and k with N\ = 3 and N 2 = 4. 

(b) Compute W nk . 

(c) Substitute W nk from (b) in the DFT matrix. 

(d) What type of FFT algorithm is this? 

(e) Can the Rader algorithm be used to compute the DFTs of length ATi or W 2 ? 

6 . 14 : (a) Compute the DFT matrices W 2 and W 3 . 

(b) Compute the Kronecker product W ' 6 = W 2 0 W 3 . 

(c) Compute the index for the vectors X and x , such that X = W§x is a DFT of 
length 6 . 

(d) Compute the index mapping for x[n] and X[k], with x = W* ®W*X being 
the IDFT. 

6 . 15 : The discrete Hartley transformation (DHT) is a transform for real signals. A 
length N transform is defined by 



N-l 



H\n\ 


= ^ ^ cas(27m/c/ N) h[k ], 
k = 0 


(6.75) 


with cas(a;) 


= sin(x) + cos(a; 


). The relation with the DFT ( f[k ] 


F[n]) is 


H[n] 


= 3ft{F[n]} — ^4{F[n]} 


(6.76) 


F[n] 


= E[n] - jO[n] 




(6.77) 


E[n] 


= | (H[n\ + H[- 


-n]) 


(6.78) 


0 [n] 


= 1 (H[n] - H[- 


-»]). 


(6.79) 


where is 


the real part, ^3 


the imaginary part, E[n] the even 


part of H[n], and 


0[n] the odd part of H[n]. 







(a) Compute the equation for the inverse DHT. 

(b) Compute (using the frequency convolution of the DFT) the steps to compute 
a convolution with the DHT. 

(c) Show possible simplifications for the algorithms from (b), if the input sequence 
is even. 



6.16: The DCT-II form is: 



X[k] = c[k] 



^|j4n]cos(|^(2n + l)fc) 



c[/c] = 




k = 0 
otherwise 



(6.80) 

(6.81) 



(a) Compute the equations for the inverse transform. 




286 6. Fourier Transforms 



(b) Compute the DCT matrix for N = 4. 

(c) Compute the transform of r[n] = {1, 2, 2, 1} and #[n] = {i,i, 

(d) What can you say about the DCT of even or odd symmetric sequences? 



6.17: The following MatLab code can be used to compute the DCT-II transform 
(assuming even length N = 2 n ), with the help of a radix- 2 FFT (see Exercise 6.9). 

function X = DCTII(x) 

N = length(x) ; */, get length 

y = [ x(l:2:N); x(N:-2:2) ]; 7* re-order elements 
Y = fft(y); 7. Compute the FFT 

w = 2*exp(-i*(0:N-l) 5 *pi/(2*N) )/sqrt (2*N) ; */, get weights 
w(l) = w(l) / sqrt(2); 7, make it unitary 

X = real(w . * Y) ; 7, compute pointwise product 

(a) Compile the program with C or MatLab. 

(b) Compute the transform of x[n] = {1, 2, 2, 1} and x [n] = {1,1, -1,-1}. 



6.18: Like the DFT, the DCT is a separable transform and, we can therefore im- 
plement a 2D DCT using ID DCTs. The 2D N x N transform is given by 



X[ni,n 2 ] = 

c[m]c[n 2 ] 

4 



iV — I i V — 1 

£*[M]cos (m(&+ 1)-^) cos (n 2 (/+ ^)-^) , 



(6.82) 



where c[0] = I/a/ 2 and c[m\ = 1 for m/0. 

Use the program introduced in Exercise 6.17 to compute an 8 x 8 DCT transform 
by 

(a) First row followed by column transforms. 

(b) First column followed by row transforms. 

(c) Direct implementation of (6.82). 

(d) Compare the results from (a) and (b) for the test data x[k,l] = k + / with 
k, l e [0,7] 



Exercises Using MaxPlusII 

6.19: (a) Implement a first-order system according to Exercise 6.3, to compute the 
Goertzel algorithm for N = 5 and n = 1, and 8- bit coefficient and input data, using 
MaxPlusII. 

(b) Determine the number of LCs and the Registered Performance. 

Simulate the design with the three input sequences: 

(c) {20,40,60,80,100}, 

(d) {j?20, y40, }60, j80, jl00}, and 

(e) {20 + j 20, 40 + j40, 60 + j60, 80 + j80, 100 + jl00}. 

6.20: (a) Design a Component to compute the (real input) 4-point Winograd DFT 
(from Example 6.16, p. 274) using MaxPlusII. The input and output precision 
should be 8 bits and 10 bit, respectively. 

(b) Determine the number of LCs and the Registered Performance. 

Simulate the design with the three input sequences: 

(c) {40, 70, 100, 10}. 

(d) {0,30,60,90}. 

(e) {80, 110,20,50}. 
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6.21: (a) Design a Component to compute the (complex input) 3-point Winograd 
DFT (from Example 6.16, p. 274) using MaxPlusII. The input and output precision 
should be 10 bits and 12 bits, respectively. 

(b) Determine the number of LCs and the Registered Performance. 

(c) Simulate the design with the input sequences {180, 220, 260). 

6.22: (a) Using the designed 3- and 4-point Components from Exercises 6.20 and 
6.21, use component instantiation to design a fully parallel 12-point Good-Thomas 
FFT similar to that shown in Fig. 6.16 (p. 273), using MaxPlusII. The input and 
output precision should be 8 bit and 12 bit, respectively. 

(b) Determine the number of LCs and the Registered Performance. 

(c) Simulate the design with the input sequences x[n] = lOn with 0 < n < 12. 

6.23: (a) Design a component ccmulp similar to the one shown in Example 6.11 
(p. 265), to compute the twiddle factor multiplication. Use three pipeline stages 
for the multiplier and one for the input subtraction X — Y, using MaxPlusII. The 
input and output precision should again be 8 bits. 

(b) Conduct a simulation to ensure that the pipelined multiplier correctly computes 
(70+j50)(121+j39). 

(c) Determine the number of LCs and the Registered Performance of the twiddle 
factor multiplier. 

(d) Now implement the whole pipelined butterfly processor. 

(e) Conduct a simulation, with the data from Example 6.12 (p. 268). 

(f) Determine the number of LCs and the Registered Performance of the whole 
pipelined butterfly processor. 
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Several algorithms exist that enable FPGAs to outperform PDSPs by an 
order of magnitude, due to the fact that FPGAs can be built with bitwise 
implementations. Such applications are the focus of this chapter. 

For number theoretic transforms (NTTs), the essential advantage of FP- 
GAs is that it is possible to implement modulo arithmetic in any desired bit 
width. NTTs are discussed in detail in Sect. 7.1. 

For error control and cryptography, two basic building blocks are used: 
Galois field arithmetic and linear feedback shift registers (LFSR). Both can 
be efficiently implemented with FPGAs, and are discussed in Sect. 7.2. If, 
for instance, an TV-bit LFSR is used as an M-multistep number generator, 
this will give an FPGA at least an MTV speed advantage over a PDSPs or 
microprocessor . 

Finally, in Sect. 7.3, communication systems designed with FPGAs will 
demonstrate low system costs, high throughput, and the possibility of fast 
prototyping. A comprehensive discussion of both coherent and incoherent 
receivers will close this chapter. 



7.1 Rectangular and Number Theoretic Transforms 
(NTTs) 

Fast implementation of convolution, and discrete Fourier transform (DFT) 
computations, are frequent problems in signal and image processing. In prac- 
tice these operations are most often implemented using fast Fourier transform 
(FFT) algorithms. NTTs can, in some instances, outperform FFT-based sys- 
tems. In addition, it is also possible to use a rectangular transform, like the 
Walsh-Had am ard or the arithmetic Fourier transform, to get an approxima- 
tion of the DFT or convolution, as will be discussed at the end of Sect. 7.1. 

In 1971, Pollard [144] defined the NTT, over a unite group, as the trans- 
form pair 

iV-l N - 1 

x[n] — TV -1 ^2 X[k\a~ nk mod M X[k] = x[k\a kn mod M, (7.1) 

k — 0 n=0 
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where N x N ~ 1 = 1 exists, and a E 7Lm {%m — {0, 1, 2, . . . , M — 1}, and 
Zm = Z/MZ) is an element of order N, i.e., a N = 1 and a k ^ 1 for all 
Ar E {1, 2, 3, . . . , N — 1} in the finite group (Zm, x) (see Exercise 7.1, p. 360). 

It is important to be able to ensure that, for a given tuple (a, M, N), such 
a transform pair exists. Clearly, a must be of order N modulo M. In order 
to ensure that the inverse NTT (INTT) exists, other requirements are: 

1) The multiplicative inverse N~ x mod M must exist, i.e., the equation x x 
N = 1 mod M must have a solution x E 

2) The determinant of the transform matrix \A\ = |[a /cn ]| must be nonzero 
so that the matrix is invertible, i.e., A~ l exists. 



1) It can only be concluded that a multiplicative inverse exists if a and 
M do not share a common factor, or in short notation, gcd(a, M) — 0. 

2) For the second condition, a well-known fact from algebra is used: The 
NTT matrix is a special case of the Vandermonde matrix (with a[k] = ajy), 
and it follows for the determinant 

a[ 0] a[0] 2 ••• a[0] L_1 

a[l\ a[l] 2 ••• a[l] L " 1 



det(V) 



1 a[L - 1] a[L - l] 2 • • • a[L - 1] 



L — l 



= _ a M)- 

k>l 



For det(V) / 0, it is required that a[k] / a[l] V k ^ /. Since the calculations 
are, in fact, modulo M, a second constraint arises. Specifically, there cannot 

be a zero multiplier in the determinant (i.e., gcd II a k -ai,M\ = 1). 

\k>l J 

In conclusion, to check the existence of an NTT, it must be verified that: 



Theorem 7.1: Existence of an NTT over 7Lm 

An NTT of length N for a defined over 'Em exists, if: 

1) gcd(a, M) — 1. 

2) a is of order TV, i.e., 

^n<N. P- 3 > 

3) The inverse det(A) -1 exist, i.e., gcd(a* — 1 , M) = 1 for / = 
1,2,..., iV — 1. 

For % p ,p = prime, all the conditions shown above are automatically satisfied. 
In 7L V elements up to an order p— 1 can be found. But transforms length p— 1 
are, in general, of limited practical interest, since in this case “general” mul- 
tiplications and modulo reductions are necessary, and it is more appropriate 
to use a “normal” FFT in binary or QRNS arithmetic [145] and [35, paper 
5-6]. 

There are no useful transforms in the ring M — 2 b . But it is possible to 
use the next neighbors, 2 b =b 1. If primes are used, then conditions 1 and 3 
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are automatically satisfied. We therefore need to discuss what kind of primes 
2 b =t 1 are known. 

Mersenne and Fermat Numbers. Primes of the form 2 6 — 1 were first in- 
vestigated by the French mathematician Marin Mersenne (1588-1648). Using 
the geometry series 



(1 + 2 ^ + 2 2<? + . . . + 2 qr ~ 1 ) { 2 q - 1 ) = 2 qr - 1 



it can be concluded that exponent 6 of a Mersenne prime must also be a 
prime. This is necessary, but not sufficient, as the example 2 11 — 1 = 23 x 89 
shows. The first Mersenne primes 2 h — 1 have exponents 



b = 2,3, 5, 7, 13, 17, 31, 61, 89, 107, 127, 521, 607, 1279. 



(7.4) 



Primes of the type 2 b + 1 are known from one of Fermat’s old letters. 
Fermat conjectured that all numbers 2 ^ 2 ) + 1 are primes but, as for Mersenne 
primes, this is necessary but not sufficient. It is necessary because if b is odd, 
i.e., b = q2 t then 



2 q2 * = ( 2 ^ + l) (2^-U 2 * _ 2 (</-2)2* + 2 U- 3 ) 2< + lj 



is not prime, as in the case of (2 4 + 1)|(2 12 + 1), i.e., 17 14097. There are five 
known Fermat primes 



F 0 = 3 F 1 = 5 F 2 = 17 F 3 = 257 F 4 = 65537, (7.5) 

but Euler (1707-1783) showed that 641 divides F 5 = 2 32 + 1. Up to F 2 i there 
are no Fermat primes, which reduce the possible prime Fermat primes for 
NTTs to the first five. 



7.1.1 Arithmetic Modulo 2 b ± 1 

In Chap. 2, the one’s complement (1C) and diminished-by-one (Dl) coding 
were reviewed. Consult Table 2.1 (p. 35) for Cl and Dl coding. It was claimed 
that Cl coding can efficiently represent arithmetic modulo 2 b — 1 . This is used 
to build Mersenne NTTs, as suggested by Rader [146]. Dl coding efficiently 
represents arithmetic modulo 2 b + 1, and is therefore preferred for Fermat 
NTTs, as suggested by Leibowitz [147]. 

The following table illustrates again the 1C and Dl arithmetic for com- 



puting addition. 


1C 


Dl 


s z=z a F b cn 


if((a == 0 )&&(& == 0 ))s = 0 
else s = a + b + cjv 
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where a and b are the input operands, s is the sum and cm the carry bit of 
the intermediate sum a + 6 without modulo reduction. To implement the 1C 
addition, first form the intermediate 5-bit sum. Then add the carry of the 
MSB cm to the LSB. In D1 arithmetic, the carry must first be inverted before 
adding it to the LSB. The hardware requirement to add modulo 2 B ± 1 is 
therefore a total of two adders. The second adder may be built using half- 
adders, because one operand, besides the carry in the LSB, is zero. 

Example 7.2: As an example, compute 10 + 7 mod M. 





1C 


Dl 


Decimal 


M = 15 


M = 17 


7 


0111 


00110 


+ 10 


+1010 


+01001 


17 


10001 


01111 


Correction 


+ 1 = 0010 


+ 1 = 1.0000 


Check: 


17 10 mod 15 = 2 


17io mod 17 = 0 



ED 



Subtraction is defined in terms of an additive inverse. Specifically, B — 
—A is said to be the additive inverse of A if A + B = 0. How the additive 
inverse is built can easily be seen by consulting Table 2.1 (p. 35). Additive 
inverse production is 



1C 


Dl 


a 


if(zf(a)! = 1 )a 



It can be seen that a bitwise complement must first be computed. That is 
sufficient in the case of 1C, and for the nonzero elements in Dl, coding. But 
for the zero in Dl, the bitwise complement should be inhibited. 

Example 7.3: The computation of the inverse of two is as follows 





1C 


Dl 


Decimal 


M = 15 


M = 17 


2 


0010 


0001 


-2 


1101 


1110 



which can be verified using the data provided in Table 2.1 (p. 35). 



ED 



The simplest a for an NTT is 2. Depending on M = 2 fc ± 1, the arithmetic 
codings (Cl for Mersenne transforms and Dl for Fermat NTTs) is selected 
first. The only necessary multiplications are then those with a k — 2 k . These 




7.1 Rectangular and Number Theoretic Transforms (NTTs) 293 



multiplications are implemented, as shown in Chap. 2, by a binary (left) 
rotation by k bit positions. The leftmost outgoing bit, i.e., carry Cjv, is copied 
to the LSB. For the D1 coding (other than where A — 0) a complement of 
the carry bit must be computed, as the following table shows: 



1C 



D1 



shl(X, fc, c N ) if(A! = 0) shl(X, A, c W) 



The following example illustrates the multiplications by a k = 2 k used 
most frequently in NTTs. 

Example 7.4: Multiplication by 2 k for 1C and D1 Coding 

The following table shows the multiplication of ±2 by 2, and finally a multi- 
plication of 2 by 8 = 2 3 to demonstrate the modulo operation for 1C and Dl 
coding. 



Decimal 


1C 

M - 15 


Dl 

M = 17 


2X2 1 


0010 


0001 


= 4 


0100 


0011 


-2 x 2 1 


1101 


1110 


=— 4 


1011 


1100 


2 x2 3 


0010 


0001 


= 16 


0001 


mi 



which can be verified using the data found in Table 2.1 (p. 35). | 7.4 | 



7.1.2 Efficient Convolutions Using NTTs 

In the last section we saw that with a being a power of two, multiplication 
was reduced to data shifts that can be built efficiently and fast with FPGAs, 
if the modulus is M = 2 b =L 1. Obviously this can be extended to complex 
as of the kind 2 U d= j2T Multiplication of complex as can also be reduced to 
simple data shifts. 

In order to avoid general multiplications and general modulo operations, 
the following constraints when building NTTs should be taken into account: 

Theorem 7.5: Constraints for Practical Useful NTTs 

A NTT is only of practical interest if 

1) The arithmetic is modulo M = 2 b ± 1. 

2) All multiplications x[k]a kn can be realized with a maximum of 2 
modulo additions. 
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7.1.3 Fast Convolution Using NTTs 

Fast cyclic convolution of two sequences x and h may be performed by multi- 
plying two transformed sequences [56, 133, 146], as described by the following 
theorem. 

Theorem 7.6: Convolution by NTT 

Let x and y be sequences of length N defined modulus M, and z = 
{x ® y ) m be the circular convolution of x and y. Let X = NTT(aj), and 
Y = NTT (y) be the length- TV NTTs of x and y computed over M. Then 
2 = NTT _1 (X© V). (7.6) 

To prove the theorem, it must first be known that the commutative, associa- 
tive, and distributive laws hold in a ring modulo M . That these properties 
hold is obvious, since Z is an integral domain (a commutative ring with unity) 
[148, 149]. 

Specifically, the circular convolution outcome, y[n], is given by 
/ N - 1 /N - 1 \ /n - l \ \ 

y[n] = (^ _1 E E A W Qfc< a "' n ) • (7J ) 

\ / = 0 \m = 0 / \k = 0 / / \j 

Applying the properties of commutation, association, and distribution, the 
sums and products can be rearranged, giving 

< N-\ N - 1 / N - 1 \ \ 

E E *mm*] E « ( m+k ~ n)l } ■ ( 7 - 8 ) 

Clearly for combinations of m, n, and k such that (m + n — k) = 0 mod N, 
the sum over / gives N ones and is therefore equal to N . However, for (m + 
n — k)w = r ^ 0, the sum is given by 

E a rl = 1 + cC + a 2r + . . . + Q , r ( iV_1 ) = = o (7.9) 

/=0 

for a r =£ 1. Because a is of order N, and r < N, it follows that a r ^ 1. It 
follows that for the sum over /, (7.8) becomes 

i Tr 1 {m+k _ n)l _ f (NN- 1 = 1)m for m + l- n = 0 mod N 

\ 0 for m + 1 — n^L 0 mod N 

1=0 k 

It is now possible to eliminate either the sum over k , using k = (n — m), or 
the sum over m, using m = (n — k). The first case gives 

VH = (Em=o x[m]h[{n - m) N ]) M , (7.10) 

while the second case gives 

Vl n ) = (EL7 h[k\x[{n - k) N ]) M ■ 



□ 



(7.11) 
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The following example demonstrates the convolution. 

Example 7.7: Fermat NTT of Length 4 

Compute the cyclic convolution of length-4 time series x[n] = {1, 1,0,0} and 
h[n] = {1, 0,0, 1}, using a Fermat NTT modulo 257. 

Solution: For the NTT of length 4 modulo M = 257, the element a = 16 
has order 4. In addition, using a symmetric range [—128, . . . , 128], we need 
4 -1 = —64 mod 257 and 16“ 1 = —64 mod 257. The transform and inverse 
transform matrices are given by 

"liiii r 1 1 1 1 “ 

1 16 -1-16 —i _ 1-16-1 16 ( . 

1 ~ i _i i _i 1 -l-ll-l- 

1 -16 -1 16 J [l 16 -1 -16 

The transform of x[n] and h[n\ is followed by the multiplication element by 
element, of the transformed sequence. The result for y[n], using the INTT, is 
shown in the following 



u, k 


= {0, 


1, 


2, 


3} 


:r[ra] 


= {1, 


1, 


o, 


0} 


X[k] 


= (2, 


17, 


o, 


-15} 


h[n] 


= {1, 


o, 


o, 


1} 


H[k] 


= {2, 


-15, 


o, 


17} 


X[fc] x H[k] 


= {4 


2, 


0 


2} 


n] = a;[ra] (?) h[n] 


= {2 


1 


0 


!}• 



Wordlength limitations for NTT. When using an NTT to perform con- 
volution, remember that all elements of the output sequence y[n] must be 
bounded by M . This is true (for simplicity, unsigned coding is assumed) if 

^max^maxT Af. (7.13) 

If the bit widths B x — log 2 (x max ), B h = \og 2 (h mSkX ), B L = log 2 (L), and 
Bm — log 2 (M) are used, it follows that for B x = Bh the maximum bit width 
of the input is bounded by 

(7.14) 

with the additional constraint that M — 2 b ± 1, and a is a power of two. It 
follows that very few prime M transforms exist. Table 7.1 displays the most 
useful choices of <as, and the attendant transform length (i.e., order of as) of 
Mersenne and Fermat NTTs. 

If complex transforms and nonprime Ms are also considered, then the 
number and length of the transform becomes larger, and the complexity also 
increases. In general, for nonprime modul, the conditions from Theorem 7.1 
(p. 290) should be checked. It is still possible to utilize Mersenne or Fermat 
arithmetic, by using the following congruence 

a mod u = (a mod ( u x i>)) mod u , 




(7.15) 
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Table 7.1. Prime M = 2 b ± 1 NTTs including complex transforms. 



Mersenne M = 2 b — 1 


Fermat M = 2 b + 1 


a 


ordM(o) 


a 


ordM(o) 


2 


6 


2 


6 


-2 


26 


V2 


26 


±2j 


46 


1+j 


46 


l±j 


86 







which states that everything is first computed modulo M — u x v — 2 6 =b 1, 
and only the output sequence need be computed modulo u, which is the 
valid module regarding Theorem 7.1. Although using M = u x v = 2 6 d= 1 
increases the internal bit width, 1C or D1 arithmetic can be used. They have 
lower complexity than modulo arithmetic modulo u ) and this will, in general, 
reduce the overall effort, for the system. 

Such nonprime NTTs are called pseudotransforms , i.e., pseudo-Mersenne 
transforms or pseudo- Fermat transforms. The following example demonstra- 
tes the construction for a pseudo-Fermat transform. 



Example 7.8: A Fermat NTT of Length 50 

Using the MatLab utility order. m (see Exercise 7.1, p. 360), it can be deter- 
mined that a = 2 is of order 50 modulo 2 25 + 1. From Theorem 7.1, we know 
that gcd(c* 2 — 1, M ) = 3, and a length 50 transform does not exist modulo 
2 25 + 1. It is therefore necessary to identify the “bad” factors in M = (2 6 ± 1), 
those that do not have order 50, and exclude these factors by using the final 
modulo operation in (7.15). 

Solution: Using the standard MatLab function factor (2~25+l ) , the prime- 
factors of M are: 

2 25 + 1 = 3 x 11 x 251 x 4051. (7.16) 

The order of a = 2 for the single factor can be computed with the algorithm 
given in Exercise 7.1 on p. 360. They are 
ord 3 (2) = 2 ordn (2) =10 

ord 2 5i (2)=50 ord 40 5i (2)=50. 1 j 

In order to have an NTT of length 50, a final modulo reduction with (2 25 T 
l)/33 must be computed. | 7.8 | 



Comparing Fermat and Mersenne NTT implementations, consider that 

• A Mersenne NTT of length 6, with 6 primes, can be converted by the 
chirp- z transform (CZT), or the Rader prime factor theorem (PFT) [132], 
into a cyclic convolution, as shown in Fig. 7.1a. In addition this allows a 
simplified bus structure if a multi-FPGA implementation [137] is used. 

• Fermat NTTs with M = 2 + 1 have a power-of-two length N = 2* , and 
can therefore be implemented with the usual Cooley-Tukey radix-2-type 
FFT algorithm, which we discussed in Chap. 6. 
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7.1.4 Multidimensional Index Maps for NTTs and the 
Agarwal— Burrus NTT 



For NTTs, in general the transform length N is proportional to the bit width 
6. This constraint makes it impossible to build long (one-dimensional) trans- 
forms, because the necessary bit width will be tremendous. It is possible to try 
the multidimensional index maps, called Good-Thomas and Cooley-Tukey, 
which we discussed in Chap. 6. If these methods are applied to NTTs, the 
following problems arise: 

• In Cooley-Tukey algorithms of length N = N 1 N 2 , an element of order N 
in the twiddle factors is needed. It follows that the transform length is not 
increased, compared with the one-dimensional case, and will result in large 
bit width. It is therefore not attractive. 

• If Good-Thomas mapping is applied, there is no need for an element of 
length N , for a length N = N 1 N 2 transform. However, two coprime length 
transforms Ni and N 2 are needed for the same M. That is impossible for 
NTTs, if the transforms listed in Table 7.1 (p. 296) are used. The only 
way to make Good-Thomas NTTs work is to use different extension fields, 
as reported in [137], or to use them in combination with Winograd short- 
convolution algorithms, but this will also increase the complexity of the 
implementation . 



An alternative method suggested by Agarwal and Burrus [150] seems to be 
more attractive. In the Agarwal-Burrus algorithm, a one-dimensional array is 
also first mapped into a two-dimensional array, but in contrast to the Good- 
Thomas methods, the lengths Ni and N 2 must not be coprime. The Agarwal- 
Burrus algorithm can be understood as a generalization of the overlap-save 
method, where periodic extensions of the signals are built. If an a of order 
2 L is used, a convolution of size 



N = 2 L 2 



(7.18) 



can be built. From Table 7.2, it can be seen that this two-dimensional method 
improves the maximum length of the transforms. 



Table 7.2. Data for some Agarwal-Burrus NTTs, to compute cyclic convolution 
using real Fermat NTTs (b = 2*, t = 0 to 4) or pseudo-Fermat NTTs t = 5,6. 



Module 


a 


ID 


2D 


2 b + 1 


2 


26 


2 e 


2 b + 1 


V2 


46 


8 b 2 



To compute the Agarwal-Burrus NTT, the following five steps are used: 
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Algorithm 7.9: Agarwal— Burrus NTT 

The cyclic convolution of x ® h of length N = 2 L 2 , with an NTT of 
length L, is accomplished with the following steps: 

1) Index transformation of the one-dimensional sequence into a two- 
dimensional array according to 



x = 



a?[0] x[L\ 

e[1] x[L T 1] 

x[L - 1] x[2 L - 1] 
0 0 



x[N - L] ' 
x[N - L - h 1] 

x[N - 1] 

0 



(7.19) 



h 



h[N -L + 1] 


MM 


■■■ h[N -2L+l\~ 


h[N - 1] 


h[L - 1] 


■■■ h[N — L — 1] 


MO] 


MM 


• • • h[N — L] 


Ml] 


h[L + 1] 


■■■ h[N - L + 1] 


h[L - 1] 


h[2L - 1] 


• • • h[N-l] 

r-d 



(7.20) 



2) Computation of the row transforms 






followed by the column 



transforms [H • • •] . 

3) Computation of the element-by-element matrix multiplication, Y = 

H 01. 

4) Inverse transforms of the columns • • •] followed by the inverse row 



transforms 



5) Reconstruction of the output sequence from the lower part of y, ac- 
cording to 





2/[0] 


y[L] ■ • 


y[N - L] 




y = 


?/[!] 


y[L + 1] • • 


y[N - L + 1] 


(7.21) 




,y[L- 1] y[2L-i) •• 


• y[N - 1] . 





The Agarwal-Burrus NTT can be demonstrated with the following example: 
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Example 7.10: Length 8 Agarwal-Burrus NTT 

An NTT modulo 257 of length 4 exists for a = 16. Compute the convolution 
of X(z) = 1 + 2 -1 + z -2 + z -3 with F(z) = 1 + 2z~ l + 3z~ 2 + 4z~ 3 using a 
Fermat NTT modulo 257. 

Solution: First, the index maps and transforms of #[ra] and f[n] are com- 



x 



"l 1 0 o" 




110 0 




0 0 0 0 


^ y 


0 0 0 0 




"0 0 0 0“ 




0 2 4 0 


i — ► F = 


13 0 0 




2 4 0 0 





Now, 



4 34 0 227 
34 32 0 2 
0 0 0 0 
227 2 0 225 

16 143 255 112 
253 114 66 206 
249 212 255 51 
253 45 195 145 

an element-by-element multiplication is computed, which results in 



(7.22) 



(7.23) 



" 64 


236 0 238 ' 




'2 6 4 0" 


121 


50 0 155 


< — >y = 


0 2 6 4 


0 


0 0 0 


16 9 4 


120 


90 0 243 




3 10 7 0 



(7.24) 



From the lower half of y , the element of y[n] = {1, 3, 6, 10, 9, 7, 4, 0} can be 



With the Agarwal-Burrus NTT, a double-size intermediate memory is 
needed, but much longer transforms can be computed. The two-dimensional 
principle can easily be extended to three-dimensional index maps, but most 
often the transform length achieved with the two-dimensional method will be 
sufficient. For instance, for a = 2 and b = 32, the transform length is increased 
from 64 in the one-dimensional case to 2 11 = 2048 in the two-dimensional 
case. 

7.1.5 Computing the DFT Matrix with NTTs 

Most often DFTs and NTTs are used to compute convolution, and it can be 
attractive to use NTTs to compute this convolution with FPGAs, because 
1C and D1 can be efficiently implemented. But sometimes it is necessary to 
compute the DFT to estimate the Fourier spectrum. Then a question arises: Is 
it possible to use the more efficient NTT to compute the DFT? This question 
has been addressed in detail by Siu and Constantinides [151]. 

The idea is as follows: For prime p-length DFTs, the Rader algorithm can 
be used, which converts the task into a length p — 1 cyclic convolution. This 
cyclic convolution is then computed by an NTT of the original sequence and 
the DFT twiddle factors in the NTT domain, multiplication elementwise, and 
the back conversion. These processing steps are illustrated in Fig. 7.1b. The 
principle is demonstrated in the following example. 
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Table 7.3. Building blocks to compute DFT with Fermat NTT. 



DFT 


Number 


a 


Number of real 


length 


ring 




Mul. 


Shift-Add. 


3 


F \ , F 2 , F 3 , F 4 , F5 , Fq 


2 2 , 2 4 , 2 8 , 2 16 , 2 32 , 2 64 


2 


6 


5 


F \ , F 2 , F 3 , Fa , Fk , Fe 


2, 2 2 , 2 4 , 2 8 , 2 16 , 2 32 


4 


20 


17 


F 3 , F4 , F5 , Fe 


2, 2 2 , 2 4 , 2 s 


16 


144 


257 


Fe 


V2 


256 


4544 


13 


F 1 , F 2 , F 3 , F 4 , F5 , Fe 


2, 2 2 , 2 4 , 2 8 , 2 16 , 2 32 


16 


104 


97 


F 4 , F5, Fe 


2, 2 2 , 2 s , 2 4 


128 


1408 


193 


F 5 ,Fe 


2, 2 2 


256 


3200 


769 


Fe 


V2 


1024 


16448 



Example 7.11: Rader algorithm for N = h 

For N = 5, a generator is g = 2, which gives the following index map, 
{2° , 2 1 , 2 2 , 2 3 } mod 5 = {1,2, 4, 3}. First, the DC component is computed 
with 

4 

X[0] = a;[n] = #[0] + #[1] + x[2\ + x[3] + a; [4] 

n = 0 

and in the second step, X[k\ — x[0], the cyclic convolution 

Ml], 42], 44], *[3]} ® {W£,Wi,Wt,Wi}. 

Now the NTT is applied to the (reordered) sequences a;[n] and , as shown 
in Example 7.7 (p. 295). The transformed sequences are then multiplied el- 
ement by element, in the NTT domain, and finally the INTT is computed. 

I 7-11 I 



For Mersenne NTTs a problem arises, in that the NTT itself is of prime 
length, and therefore the length increased by one can not be of prime length. 
But for a Fermat NTT, the length is 2*, since M — 2 t + 1, which is a prime. 
Siu and Constantinides found eight such short-length DFT building blocks 
to be useful. These basic building blocks are summarized in Table 7.3. 

The first part of Table 7.3 shows blocks that do not need an index trans- 
form. In the second part are listed the building blocks that have two coprime 
factors. They are 13 - 1 = 3 x 4, 97 - 1 = 3 x 32, 193 - 1 = 3 x 64, and 
769 — 1 = 3 x 256. The disadvantage of the two-factor case is that, in a 
two-dimensional index map, for only one dimension every second transform 
of the twiddle factor becomes zero. 

In the multidimensional map, it is also possible to implement a radix-2 
FFT-like algorithm, or to combine Fermat NTTs with other NTT algorithms, 
such as the (pseudo-) Fermat NTT transform, (pseudo-) Mersenne transform, 
Lagrange interpolation, Eisenstein NTTs or a short convolution algorithm 
such as the Winograd algorithm [56, 151]. 
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(a) x[n] 



*[ 0 ] = £„*["] 



X[0] 



x[(g n ) N ]® ^<"">"+*[0] 



X[k] 

k€[l,N-l] 




Fig. 7.1. The use of NTTs in Rader’s prime- length algorithm for computing the 
DFT. (a) Rader’s original algorithm, (b) Modification of the Rader prime algorithm 
using NTTs. 



In the following, the techniques for the two-factor case using a length 
13— 1 = 3x4 multidimensional index map are reviewed. This is similar to 
the discussion in Chap. 6 for FFTs. 

7.1.6 Index Maps for NTTs 

To directly realize the NTT matrix is generally too expensive. This prob- 
lem may be resolved by suitable multidimensional techniques. Burrus [111] 
gives a systematic overview of different common and prime factor maps, 
from one dimension to multiple dimensions. The mapping is explained for 
the two-dimensional case. Higher-order mapping is equivalent. The mapping 
from the one-dimensional cyclic length- TV convolution from (7.1), into a two- 
dimensional convolution with dimension N = N\ x N 2 , can be written in 
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linear form as follows: 

n — Mini + M 2 U 2 mod TV, (7.25) 

where ni E {0,1,2,..., Ni~ 1} and n 2 E {0,1,2,..., N 2 — 1). For gcd(Ah, AT 2 ) 
^ 1, the well-known Cooley-Tukey FFT algorithm may be used. Burrus [111] 
shows that the map is cyclic in both dimensions if and only if Ni and N 2 are 
relatively prime, i.e. , gcd (JV}, N 2 ) = 1. In order for this map to be one-to-one 
and onto (i.e., a bijection), the mapping constants M 1 and M 2 must satisfy 
certain conditions. For the relatively prime case, the conditions to make the 
mapping bijective are: 



[Mi = f3Ni and/or M 2 = 7 Ah] and 
gcd(M 1} IVi) = gcd(M 2 ,M 2 ) = 1. (7.26) 



As an example, consider N 1 = 3 and N 2 = 4, N = 12. From condi- 
tion (7.26) we see that it is necessary to choose M 1 (a multiple of 7V 2 ), 
or M 2 (a multiple of Ah), or both. Make Mi the simplest multiple of iV 2 , 
i.e., Mi = N 2 = 4, which also satisfies gcd(Mi, Ah) = gcd(4,3) = 1. Then, 
noting that gcd(M 2 ,N 2 ) = gcd(M 2 ,4) = 1, the possible values for M 2 are 
{1, 3, 5, 7, 9, 11}. As a simple choice, select M 2 = Ni = 3. The map becomes 
n = (4ni + 3ra 2 )i 2 . Now let us apply the map to consider a 12-point convo- 
lution example. The transform of the one-dimensional cyclic array x[n] into 
a 3 x 4 two-dimensional array x[n 1, n 2 ], produces 



[a:[0]a?[l]a?[2] 



. x [1 1]] -f-)- 



z[0] x[3] a? [6] a?[9] 
x[4\ x[7] a: [10] a?[l] 
x[8] a? [11] a: [2] x [5] 



(7.27) 



To recover the sequence X[k\ from the X[k 1, Ar 2 ], use the Chinese remainder 
theorem, as suggested by Good [134], 

k = ((N- 1 mod Ni)N 2 ki + (TVf 1 mod N 2 )Nik 2 ) N . (7.28) 

The a matrix can now be rewritten as 

Ni — 1 /N 2 -l \ 

X[ki,k 2 ] = ^2 [ ^2 x [ n i’ n z ] a w7 ) a JVi fcl > ( 7 - 29 ) 

ni = 0 V n 2 = 0 / 

where is an element of order N{. Having mapped the original sequence 
x[n] into the two-dimensional array x[ni, n 2 ], the desired matrix can be eval- 
uated by the following two steps: 

1) Perform an N 2 -point NTT on each row of the matrix x[n\, n 2 ]. 

2 ) Perform an Ah-point NTT on each column of the resultant matrix, to 
yield X[k u k 2 ]. 
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4-point NTTs 3-point NTTs 

Fig. 7.2. Two-dimensional map. First stage: three 4-point NTTs. Second stage: 
four 3-point NTTs. 



These processing steps are shown in Fig. 7.2. The input map is given by 
(7.27), while the output map can be computed with (7.28), 

k = ({4 _1 ) 3 4fc 1 + (3 -1 ) 4 3fc 2 ) 12 = (4*i + 9* 2 >i2. (7.30) 

The array X[ki, k^\ will therefore have the following arrangement: 



[X[Q]X[1]X[2]...X[U]]*> 



"X[0] X[9] X[ 6 ] X[S\ 
X[4t] X[l] X[10] X[7] 
X[8] X[5] X[2] X[ll] 



(7.31) 



Length 97 DFT case study. In recent years programmable digital sig- 
nal processors (e.g., TMS320; Motorola 56K; AT&T 32C) have become the 
dominant vehicle to implement fast convolution via FFT algorithms. These 
PDSPs provide a fast (real) multiplier with typical cycle times of 10 to 50 ns. 
There are also some NTT implementations [152], but NTT implementations 
need modulo arithmetic, which is not supported by general-purpose PDSPs. 
Dedicated accelerators, such as the FNT from McClellan [152], use 90 stan- 
dard ECL 10K ICs. In recent years, field-programmable gate arrays (FPGAs) 
have become dense enough and fast enough to implement typical high-speed 
DSP applications [4, 121]. It is possible to implement several arithmetic cores 
with only one FPGA, producing good packaging, speed, and power character- 
istics. FPGAs, with their fine granularity, can implement modulo arithmetic 
efficiently, without penalty, as in the PDSP case. 

In NTT implementation of Fermat number arithmetic, the previously dis- 
cussed speed and hardware advantages, compared with conventional FFT im- 
plementations, become an even bigger advantage for an FPGA implementa- 
tion. By implementing the DFT algorithm with the Rader prime convolution 
strategy, the required I/O performance can be further reduced. 

To clarify the NTT design paradigm, a length-97 DFT in the Fermat 
number system, F 4 and F5, for real input data, will be shown. A Xilinx XC4K 
multi-FPGA board has been used to implement this design, as reported in 

[ 137 ]. 
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For modulo Fermat number arithmetic (modulo 2 n + 1) it is advantageous 
to use, instead of the usual two’s complement arithmetic (2C), the “Dimin- 
ished one” (Dl) number system from Leibowitz [147]. Negative numbers are 
the same as in 2C, and positive numbers are diminished by one. The zero is 
encoded as a zero string and the MSB “ZERO-FLAG” is one. Therefore the 
diminished system consists of a ZERO-FLAG and integer bits Xk- For 2C, 
the MSB is the sign bit, while for Dl the second MSB is the sign bit. With 
this encoding the basic operations of 2C<-»D1 conversion, negation, addition, 
and multiplication by 2 m can easily be determined, as shown in Sect. 7.1.1 
(p. 291). 

The rough processing steps of the 97-point transform are shown in 
Fig. 7.1b. A direct length-96 implementation for a single NTT will cost at 
least 96x2 barrel shifters and 96x2 accumulators and, therefore, approxi- 
mately 96(2 x 32 + 2 x 18) = 9600 Xilinx combinatorial logic blocks (CLBs). 
Therefore it seemed reasonable to use a 32 x 3 index map, as described in 
the last section. The length-32 FFT now becomes a simpler length-32 Fermat 
NTT, and the length-3 transform has a k = 1; j and — 1 — j with j 2 — j - f 1. 
The 32-point FNT can be realized with the usual radix-2 FFT- type algo- 
rithm, while the length-3 transform can be implemented by a two-tap FIR 
filter. The following table gives CLB utilization estimates for Xilinx XC4000 
FPGAs, for F 4 : 



Length-32 


Length-3 


14 Multipliers 


Length-3 


Two length-32 


FNT 


FIR NTT 


32-bit 


NTTS -1 


FNT " 1 


104 


108 


462 


288 


216 



The design consumes a total of 1178 CLBs. To get high throughput, the 
buffer memory between the blocks must be doubled. Two real buffers for the 
first FNT, and three complex buffers, are required. If the buffers are realized 
internally, an additional 748 CLBs are required, which will also minimize the 
I/O requirements. If 80% utilization is assumed, then about six XC4010s are 
needed for the design, including the buffer memory. 

The time-critical path in the design is the length-32 FNT. To maximize 
the throughput, a three-stage pipeline is used inside the butterfly. For a 5-ns 
FPGA, the butterfly speed is 28 ns for E 4 , and 38.5 ns for F§. For three length- 
32 FNTs, five stages, each with 16 butterflies, must be computed. This gives 
a total transform time of 7.15 ps for F 4 , and 9.24 ps for F 5 , for the length-97 
DFT. To set this result in perspective, the time for the butterfly computation 
gives a fair comparison. A TMS320C50 PDSP with a 50-ns cycle time needs 
17 cycles for a butterfly [153], or 850 ns, assuming zero wait-state memory. 
Another “conventional” FPGA fixed-point arithmetic design [121] uses four 
serial/parallel multipliers (2 MHz), and therefore has a latency of 500ns for 
the butterfly. 
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Fig. 7.3. DFT computation using rectangular transform and map matrix T. 



7.1.7 Using Rectangular Transforms to Compute the DFT 

Rectangular transforms also map an input sequence in an image domain, but 
do not necessarily have the DFT structure, i.e., A = [ a nk ]. Examples are Haar 
transforms [93], the Walsh-Had am ard transform [61], a ruff-quantized DFT 
[5], or the arithmetic Fourier transform [136, 154, 155]. These rectangular 
transforms, in general, do not support cyclic convolution, but they may be 
used to approximate the DFT spectrum [128]. The advantage of rectangular 
transforms is that the coefficients are from the set { — 1,0, 1} and they do not 
need any multiplications. 

How to compute the DFT is shown in Fig. 7.3. In order to have a useful 
system, it is assumed that the rectangular transform can be computed with 
low effort, and the second transform using the matrix T, which maps the 
rectangular transform to the DFT vectors, has only a few nonzero elements. 



Table 7.4. Comparison of different transforms to approximate the DFT [5]. 



Transform 


Number of 
base 


Algorithmic 

complexity 


Zeros in 

16 x 16 T Matrix 


Walsh 


N 


Nlog 2 (N) 


66 


Hadamard 


N 


N\og 2 (N) 


66 


Haar 


N 


2 N 


18 


AFT 


77+1 


N 2 


82 


QDFT 


2 N 


(7V/8) 2 +3N 


86 



Table 7.4 compares different implementations. The algorithmic complex- 
ity of the Walsh-Had am ard and Haar transforms is most interesting, but 
from the number of zeros in the second transform T it can be concluded 
that the arithmetic Fourier transform and the ruff-quantized DFT are more 
attractive for approximating the DFT. 
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7.2 Error Control and Cryptography 

Modern communications systems, such as pagers, mobile phones or satellite 
transmission systems, use algorithms to correct transmission errors, since 
error-correction coding better utilizes the band-limited channel capacity than 
special modulation schemes (see Fig. 7.4). In addition, most systems also use 
cryptography algorithms, not just to protect messages against unauthorized 
listeners, but also to protect messages against unauthorized changes. 

In a typical transmission scheme, such as that shown in Fig. 7.5, the 
encoder (for error correction or cryptography) is placed between the data 
source and the actual modulation. On the receiver side, the decoder is located 
between demodulation and the data destination (sink) . Often an encoder and 
decoder are combined in one circuit, referred to as a CODEC. 

Typical error correction and cryptographic algorithms use finite field 
arithmetic and are therefore more suitable for FPGAs than they are for 
PDSPs [157]. Bitwise operations or linear feedback shift registers (LFSR) 
can be very efficiently realized with FPGAs. Some CODEC schemes use 
large tables, and one objective when selecting the appropriate algorithms 
for FPGAs is therefore to find out which algorithms are most suitable. The 




Fig. 7.4. Performance of modulation schemes [156]. Solid line coherent demodu- 
lation and dashed line incoherent demodulation. 
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Fig. 7.5. Typical communications system configuration. 



algorithms presented in this section are mainly based on previous publica- 
tions [4] and have been used to develop a paging system for low frequencies 
[158, 159, 160, 161, 162], and an error-correction scheme for radio-controlled 
watches [163, 164]. 

It is impossible in a short section to present the whole theory of error 
correction and cryptography. We will present the basic ideas and suggest, for 
further investigation, one of the excellent textbooks in this area [126, 165, 
166, 167, 168, 169, 170]. 



7.2.1 Basic Concepts from Coding Theory 

The simplest way to protect a digital transmission against random errors 
is to repeat the message several times. This is called repetition code. For a 
repetition of 5, for instance, the message is sent five times, i.e., 

0 00000 (7.32) 

1 11111 , ( 7 . 33 ) 

where the left side shows the k information bits and the right side the n- 
bit codewords. The minimum distance between two codewords, also called 
the Hamming distance d*, is also n and the repetition code is of the form 
(n, k , d*) = (5,1,5). With such a code it is possible to correct up to [(n — 1)/2J 
random errors. But from the perspective of channel efficiency, this code is 
not very attractive. If our system is two-way then it is more efficient to use a 
technique such as a parity check and an automatic repeat request (ARQ) for 
any detected parity error. Such parity checks are used, for instance, in PC 
memory. 

Error correction using a Hamming code. If a few more parity check 
bits are added, it is possible to correct a word with a parity error. 

If the parities Pip, Pyi, Pi 2 ? an< ^ Pi 3 are computed using modulo 2 op- 
erations, i.e., XOR, according to 

Pl,0 = 2*21 0 *22 0 2*23 0 224 5 0 2*25 0 2*26 0 2*27 
Pl,l — 2*21 0 2*23 0 2*25 0 2*27 

Pi ,2 — 2*21 0 2*22 0 2*25 0 2*26 

Pi, 3 = 2*21 0 2*22 0 2*23 0 2*24 
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then the parity detector is (= Pi,o) and three additional bits are necessary 
to locate the error position. Figure 7.6a shows the encoder, and Fig. 7.6b 
the decoder including the correction logic. On the decoder side the incoming 
parities are XOR’d with the newly computed parities. This forms the so- 
called syndrome (S^o • * -Si, 3 ). The parities have been chosen in such a way 
that the syndrome pattern corresponds to the position of the bit in binary 
code, i.e., a 3 — >■ 7 demultiplexer can be used to decode the error location. 

For a more compact representation of the decoder, the following parity 
check matrix H can be used 

1 1 1 1 1 1 1 1 0 0 0 " 

10101010100 
11001100010 
11110000001 




(7.34) 
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Table 7.5. Estimated effort for error correction with Hamming code. 



Block 

Hamming code 




CLB effort for 


Minutes 

(11,7,3) 


Hours 

(10,6,3) 


Date 

(27,22,3) 


Register 


6 


6 


14 


Syndrome computation 


5 


5 


16 


Correction logic 


4 


4 


22 


Output register 


4 


4 


11 


Sum 


19 


19 


63 


Total 




101 





It is possible to describe the encoder using a generator matrix. G = [J:P], be., 
the generator matrix consists of a systematic identity matrix I followed by 
the parity-bits matrix P. A codeword v is computed by multiplying (modulo 
2) the information word i with the generator matrix G : 

v = ixG. (7.35) 



The (de)coders shown in Fig. 7.6 are those for a (11,7,3) Hamming code, 
and it is possible to detect and correct one error. In general, it can be shown 
that for 4 parity bits, up to 15 information bits, can be used, i.e., a (15,11,3) 
Hamming code has been shortened to a (11,7,3) code. 

A Hamming code with distance 3 generally has a (2 m — l,2 m — m, 3) 
structure. The dates in radio-controlled watches, for instance, are coded with 
22 bits, and a (31,26,3) Hamming code can be shortened to a (27,22,3) code 
to archieve a single-error correcting code. The parity check matrix becomes: 



H = 



" 101010101010101010101010000 “ 

110011001100110011001101000 

111100001111000011110000100 

111111110000000011111100010 

111111111111111100000000001 . 



Again, the syndromes can be sorted in such a way that the correction logic 
is a simple 5 22 demultiplexer. 

Table 7.5 shows the estimated effort in CLBs using Xilinx XC3K FP- 
GAs for an error-correction unit for radio-controlled watches that uses three 
separate data blocks for minutes, hours, and date. 

In conclusion, with an additional 3+3+5=11 bits and the parity bits for 
the minutes using about 100 CLBs, it is possible to correct one error in each 
of the three blocks. 



Survey of Error Correction Codes 

After the introductory case study in the last section, commonly used codes 
and possible encoder and decoder implementations will be discussed next. 
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Fig. 7.7. Decoder for error correction. 



Most often the effort for the decoder is of greater concern, since many com- 
munications systems like pager or radios use one sender and several receivers. 
Fig. 7.7 shows a diagram of possible decoders. 

Some nearly optimal decoders use huge tables and are not included in 
Fig. 7.7. The difference between block and convolutional codes is based on 
whether “memory” is used in the code generation. Both methods are charac- 
terized by the code rate R , which is the quotient of the information bits and 
the code length, i.e., R = k/n. For tree codes with memory, the actual output 
block, which is n bits long, depends not only on the present k information 
bits, but also on the previous m symbols, as shown in Fig. 7.8. Character- 
istics of convolution codes are the memory length v — m x k, as well the 
distance profile, the free distance c//, and the minimum distance d m (see, for 
instance, [126]). Block codes can most often be constructed with algebraic 
methods using Galois fields, but tree codes are often only found in computer 
simulations. 

Our discussion will be limited to Unear codes, i.e., codes where the sum of 
two codewords is again a codeword, because this simplifies the decoder imple- 
mentation. For linear codes, the Hamming distance can always be computed 
as the difference between a codeword and the zero word, which simplifies com- 
parisons of the performance of the code. Linear tree codes are often called 
convolutional codes , because the codes can be built using an FIR-like struc- 
ture. Convolutional codes may be catastrophic or noncatastrophic. In the case 
of a catastrophic code, a single error will be propagated forever. It can be 
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Fig. 7.8. Parameters of the convolutional encoders. 

shown that systematic convolutional codes are always noncatastrophic. It is 
also common to distinguish between random error correction codes and burst 
error correction codes. In burst error correction, there may be a long burst 
of errors (or erasures). In random error correction code, the capability to cor- 
rect errors is not limited to consecutive bits - the error may have a random 
position in the received codeword. 

Coding bounds. With coding bounds we can compare different coding 
schemes. The bounds show the maximum error correction capability of the 
code. A decoder can never be better than the upper bound of the code, and 
sometimes to reduce the complexity of the decoder it is necessary to decode 
less than the theoretical bound. 

A simple but still good, rough estimation is the Singleton bound or the 
Hamming bound. The Singleton bound states that the minimum Hamming 
distance d* is upper bounded by the number of parity bits (n — k). It is also 
known [126, p. 256] that the number of correctable errors t and the number 
of erasures e for a code is upper bounded by the Hamming distance. This 
gives the following bounds: 

e + 2f+l<d*<n-A + l. (7.36) 

A code with d * = n — k + 1 is called maximum distance separable , but besides 
the repetition code and the parity check code, there are no binary maximum 
distance separable codes [126, p. 431]. Following the example in the last 
section from Table 7.5, with 11 parity bits the upper bound can be used to 
correct up to five errors. 

For a ^-error-correcting binary code, the following Hamming bound pro- 
vides a good estimation: 




(7.37) 
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Equation (7.37) says that the possible number of parity check patterns ( 2 n ~ k ) 
must be greater than or equal to the number of error patterns. If the equal 
sign is valid in (7.37), such codes are called perfect codes. A perfect code is, 
for instance, the Hamming code discussed in the last section. If it is desired, 
for instance, to find a code to protect all 44 bits transmitted in one minute for 
radio-controlled watches, using the maximum-available 13 parity bits, then 
it follows that 



2 13 > 
2 13 < 




44 \ /44\ /44\ / 44 

0 ) + ( l) + ( 2 ) + ( 3 



(7.38) 

(7.39) 



i.e., it should be possible to find a code with the capability to correct two 
random errors but none with three errors. In the following sections we will 
review such block encoders and decoders, and then discuss convolutional 
encoders and decoders. 



7.2.2 Block Codes 

The linear cyclic binary BCH codes (from Bose, Chaudhuri, and Hocquen- 
ghem) and the subclass of Reed— Solomon codes, consist of a large class of 
block codes. BCH codes have various known efficient decoders in the time 
and frequency domains. In the following, we will illustrate the shortening of 
a (63,50,6) to a (57,44,6) BCH code. The algorithm is discussed in detail by 
Blahut [126, pp. 162—6]. 

The code is based on a transformation of GF(2 6 ) to GF(2). To describe 
GF(2 6 ), a primitive polynomial of degree 6 is needed, such as P(a?) = x 6 + x-{- 
1. To compute the generator polynomial, the least common multiple of the 
first d— 1 = 5 minimal polynomials in GF(2 6 ) must be computed. If a denotes 
a primitive element in GF(2 6 ), it follows then that a 0 = 1 and = x — 1. 

The minimum polynomials of o, a 2 and a 4 are identical m a ^ = x 6 + x -f- 1, 
and the minimum polynomial to a 3 is m a 3^) = x 6 + x 4 + x 2 + x + 1. It is 
now possible to build the generator polynomial, g(x ): 

g(x) = m 1{x) x m a(a7 ) x m a3(a .) (7.40) 

- X 13 + X 12 + x 11 + £ 10 + £ 9 + T X 6 1 X 3 f X + 1. (7.41) 

Using this generator polynomial (to compute the parity bits), it is now a 
straight forward procedure to build the encoder and decoder. 

Encoder. Since a systematic code is desired, the first codeword bits are iden- 
tical with the information bits. The parity bits p(x) are computed by modulo 
reduction of the information bits i(x ) shifted in order to get a systematic 
code according to: 

p(x) = i(x) x x n ~ k mod g(x). 



(7.42) 
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Fig. 7.9. Encoder for (57,44,6) BCH code. 



Such a modulo reduction can be archieved with a recursive shift register as 
shown in Fig. 7.9. The circuit works as follows: In the beginning, switches 
A and B are closed and C is open. Next, the information bits are applied 
(MSB first) and directly transferred to the codeword. At the same time, the 
recursive shift register computes the parity bits. After the information bits 
are all processed, switches A and B are opened and C is closed. The parity 
bits are now shifted into the codeword. 

Decoder. The decoder is usually more complex than the encoder. A Meg- 
gitt decoder can be used for decoding in the time domain, and frequency 
decoding is also possible, but it needs a detailed understanding of the alge- 
braic properties of BCH codes ([126, pp. 166—200], [165, pp. 81 — 107], [166, 
pp. 65-73]). Such frequency decoders for FPGAs are already available as in- 
tellectual property (IP) blocks, sometimes also called “virtual components,” 
VC (see [18, 19, 171]). 

The Meggitt decoder (shown in Fig. 7.10) is very efficient for codes with 
only a few errors to be corrected, since the decoder uses the cyclic properties 
of BCH codes. Only errors in the highest bit position are corrected and then 
a cyclic shift is computed, so that eventually all corrupted bits pass the MSB 
position and are corrected. 

In order to use a shortened code and to regain the cyclic properties of 
the codes, a forward incoupling of the received data a(x) must be computed. 
This condition can be gained for code shortened by b bits using the condition 

s(a?) = a(x)i(x)mod g(x) = x n ~ k+b i(x)mod g(x). (7.43) 

For the shortened (57,44,6) BCH code this becomes 

a(x) = £ 63-50+6 mod g(x) = r 19 mod g(x) 

— .r 19 mod (x 13 + x 12 + x 11 + *r 10 + x 9 -f x 8 + x 6 + x 3 + x T 1) 

= x 10 + X 1 + X 6 + X 5 + X 3 + X + 1. 

The developed code has the ability to correct two errors. If only the error in 
the MSB need be corrected, a total of 1 + ( 5 1 6 ) = 1 + 56 = 57 different error 
patterns must be stored, as shown in Table 7.6. The 57 syndrome values can 
be computed through a simulation and are listed in [164, B.3]. 




314 



7. Advanced Topics 



Received Modulo g(x ) 




Fig. 7.10. Basic blocks of the Meggitt decoder. 



Now all the building blocks are available for constructing the Meggitt 
decoder for the (57,44,6) BCH code. The decoder is shown in Fig. 7.11. 

The Meggitt decoder has two stages. In the initialization phase, the syn- 
drome is computed by processing the received bits modulo the generator 
polynomial g(x). This takes 57 cycles. In the second phase, the actual error 
correction takes place. The content of the syndrome register is compared with 
the values of the syndrome table. If an entry is found, the table delivers a 
one, otherwise it delivers a zero. This hit bit is then XOR’d with the received 
bits in the shift register. In this way, the error is removed from the shift reg- 
ister. The hit bit is also wired to the syndrome register, to remove the error 
pattern from the syndrome register. Once again the syndrome and the shift 
register are clocked, and the next correction can be done. At the end, the 
shift register should include the corrected word, while the syndrome register 
should contain the all-zero word. If the syndrome is not zero, then more than 
two errors have occurred, and these can not be corrected with this BCH code. 



Table 7.6. Table of possible error patterns. 



No. 






Error pattern 






1 


0 


0 


0 • 


• • 0 


0 


1 


2 


0 


0 


0 • 


0 


1 


1 


3 


0 


0 


0 • 


• • 1 


0 


1 


56 


0 


1 


0 • 


• • 0 


0 


1 


57 


1 


0 


0 • 


• • 0 


0 


1 
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Fig. 7.11. Meggitt decoder for (57,44,6) BCH code. 



Our only concern for an FPGA implementation of the Meggitt decoder 
is the large number (13) of inputs for the syndrome table, because the LUTs 
of FPGAs typically have 4 to 8 inputs. It is possible to use an external 
EPROM or (for Altera Flex 10K) four 2-kbit EABs to implement a table 
of size 2 13 x 1. The syndrome is wired to the address lines, which deliver a 
hit (one) for the 57 syndromes, and otherwise a zero. It is also possible to 
use the logic synthesis tool to compute the table with internal logic blocks 
on the FPGA. The Xilinx XNFOPT (used in [164]) needs 132 LUTs, each 
with 2 4 x 2 bits. If modern binary decision diagrams (BBDs) synthesizer type 
[172, 173, 174] are used, this number can (at the cost of additional delays) 
be reduced to 58 LUTs with a size of 2 4 x 2 bits [175]. Table 7.7 shows the 
estimated effort, using Flex 10K, for the Meggitt decoder using the different 
kinds of syndrome tables. 



Table 7.7. Estimated effort for Altera FLEX devices, for the three versions of the 
Meggitt decoder based on XC3K implementations [4]. (EABs are used as 2 11 x 1 
ROMs.) 



Function group 


Syndrome table 




Using EABs 


Only LCs 


BDD [175] 


Interface 


36 LCs 


36 LCs 


36 LCs 


Syndrome table 


2 LCs, 4 EABs 


264 LCs 


116 LCs 


64-bit FIFO 


64 LCs 


64 LCs 


64 LCs 


Meggitt decoder 


12 LCs 


12 LCs 


12 LCs 


State machine 


21 LCs 


21 LCs 


21 LCs 


Total 


135 LCs, 4 EABs 


397 LCs 


249 LCs 
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7.2.3 Convolutional Codes 

We also want to explore the kind of convolutional error-correcting decoders 
that are suitable for an FPGA realization. To simplify the discussion, the 
following constraints are defined, which are typical for communications sys- 
tems: 

• The code should minimize the complexity of the decoder. Encoder com- 
plexity is of less concern. 

• The code is linear systematic. 

• The code is convolutional. 

• The code should allow random error correction. 

A systematic code is stipulated to allow a power-down mode, in which 
only incoming bits are received without error correction [163]. A random- 
error-correction code is stipulated if the channel is slow fading. 

Figure 7.12 shows a diagram of the possible tree codes, while Fig. 7.7 
(p. 310) shows possible decoders. Fano and stack decoders are not very suit- 
able for an FPGA implementation because of the complexity of organizing a 
stack [162]. A conventional pP/pC realization is much more suitable here. In 
the following sections, maximum-likelihood sequence decoders and algebraic 
algorithms are compared regarding hardware complexity, measured in CLBs 
usage for the Xilinx XC3K FPGA, and achievable error correction. 

Viterbi maximum likelihood sequence decoder. The Viterbi decoder 
deals with an erroneous sequence by determining the corresponding sender 
sequence with the minimum Hamming distance. Put differently, the algorithm 
finds the optimal path through the trellis diagram, and is therefore an optimal 
memoryless noisy-sequence estimator (MLSE). 

The advantage of the Viterbi decoder is its constant decoding time and 
MLSE optimality. The disadvantage lies in its high memory requirements and 
resulting limitation to codes with very short constraint length. Figures 7.13 
and 7.14 show an R — k/n — 1/2 encoder and the attendant trellis diagram. 
The constraint length u — m x k is 2, so the trellis has 2 V nodes. Each node 
has 2 k = 2 outgoing and at most 2 k —2 incoming edges. For a binary trellis 
(k = 1) like this, it is convenient to show a zero as an upward edge and a one 
as a downward edge. 

For MLSE decoding it is sufficient to store only the 2 V paths (and their 
metrics) passing through the nodes at a given level, because the MLSE path 
must pass through one of these nodes. Incoming paths with a smaller metric 
than the “survivor” with the highest metric need not be stored, because these 
paths will never be part of the MLSE path. Nevertheless, the maximum 
metric at any given time may not be part of the MLSE path if it is part 
of a short erroneous sequence. Voting down such a local error is analogous 
to demodulating a digital FM signal with memory [176]. Simulation results 
in [126, p. 381] and [165, pp. 120—3] show that it is sufficient to construct a 
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Fig. 7.12. Survey of tree codes [126]. 



path memory of four to five times the constraint length. Infinite path memory 
yields no significant improvement. 

The Viterbi decoder hardware consists of three main parts: path memory 
with output decoder (see Fig. 7.15), survivor computation, and maximum 
detection (see Fig. 7.16). The path memory is 4^2^ bits, consuming 2v2 u 
CLBs. The output decoder uses (1 + 2 + . . . + 2 u ~ l ) 2-to-l multiplexers. 
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Fig. 7.13. Encoder for an R = 1/2 convolutional decoder. 



Metric update adders, registers and comparisons are each (|dog 2 (V * ra)] + 1) 
bits wide. For the maximum computation, additional comparisons, 2-to-l 
multiplexers and a decoder are necessary. 

The hardware for decoders with k > 1 seems too complex to implement 
with today’s FPGAs. For n > 2 the information rate R = l/n is too low, 
so the most suitable code rate is R = 1/2. Table 7.8 lists the complexity in 
CLBs in a XC3K FPGA for constraint lengths v — 2,3,4, and the general 
case, for R — 1/2. It can be seen that complexity increases exponentially 
with constraint length z/, which should thus be as short as possible. Although 
very few errors can be corrected in the short window allowed by such a small 
constraint length, the MLSE algorithm guarantees acceptable performance. 

Next it is necessary to choose an appropriate generating polynomial. It 
is shown in the literature ([177, pp. 306-8], [178, p. 465], [167, pp. 402-7], 
[126, p. 367]) that, for a given constraint length, nonsystematic codes have 
better performance than systematic codes, but using a nonsystematic code 
contradicts the demand for using the information bits without error correc- 
tion. Quick look in (QLI) codes are nonsystematic convolution codes with 
R = 1/2, providing free distance values as good as any known code for con- 
straint lengths v — 2 to 4 [179]. The advantage of QLI codes is that only one 
XOR gate is necessary for the reconstruction of the information sequence. 
QLIs with v — 2,3, and 4 have a free distance of df = 5,6, and 7, respectively 




Fig. 7.14. Trellis for R = 1/2 convolutional decoder. 
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Path memory and output decoder 




L 2 ! J 



Fig. 7.15. Viterbi decoder with constraint length 4is and 2 U 2 nodes: path memory 
and output decoder. 



[178, p. 465]. This seems to be a good compromise for low power consump- 



Table 7.8. Hardware complexity in CLBs for an R = 1/2 Viterbi decoder for 
v — 2, 3, 4, and the general case. 



Function 


is = 2 


is = 3 


IS = 4 


v € N 


Path memory 


16 


48 


128 


4 x v x 2 I/_1 


Output decoder 


1,5 


3,5 


6,5 


1 + 2 + ... + 2 1 '- 2 


Metric AM 


4 


4 


4 


4 


Metric clear 


1 


2 


4 


[(2 + 4 + ... + 2 I ' -1 )/4"| 


Metric adder 


24 


64 


128 


(riog 2 (ni')l + 1) x 2 1 ' +1 


Survivor-MUX 


6 


24 


48 


(fl°g 2 (nj/)l + 1) x 2" -1 


Metric compare 


6 


24 


48 


(riog 2 (m/)l + 1) x 2 v ~ l 










([l°g 2 (>w)l + 1) 


Maximum compare 


4,5 


14 


30 


xfx (1+2 + .. .2 1 '- 1 ) 










(2 + ... + 2"- 1 ) 


MUX 


3 


12 


28 


x |x (riog 2 (m/)l + 1) 


Decoder 


1 


2 


4 


[(2 + . . . + 2 t ' -1 )/4] 



State machine 4 4 4 



320 



7. Advanced Topics 



Metric update Maximum detector 




Fig. 7.16. Viterbi decoder with constraint length Av and 2 1 ' 2 nodes: metric cal- 
culation. 



tion. The upper part of Table 7.9 shows the generating polynomials in octal 
notation. 



Error-correction performance of the QLI decoder. To compute the 
error-correction performance of the QLI decoder, it is convenient to use the 
“union bound” method. Because QLI codes are linear, error sequences can be 
computed as a difference from the zero sequence. An MLSE decoder will make 
an incorrect decision if a sequence that starts at the null state, and differs 
from the null- word at j separate time steps, contains at least j/2 ones. The 
probability of this occurrence is 



£L(j+i)/2 (i)pV for odd j 

l(i/ 2 )^ /2 ? j/2 + HUj/2+1 for even j. 



(7.44) 



Now the only thing necessary for a bit-error probability formula is to 
compute the number wj of paths with weight j for the code, which is an 
easily programmable task [162, C.4]. Because Pj decreases exponentially with 
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Table 7.9. Union-bound weights for a Viterbi decoder with v — 2 to 4 using QLI 
codes. 



Code 


Ol = 7 


Ol = 74 


Ol = 66 




02 = 5 


02 = 54 


02 = 46 


Constraint length 


zy = 2 


is = 3 


is = 4 


Distance 




Weight Wj 




0-4 


0 


0 


0 


5 


1 


0 


0 


6 


4 


2 


0 


7 


12 


7 


4 


8 


32 


18 


12 


9 


80 


49 


26 


10 


192 


130 


74 


11 


448 


333 


205 


12 


1024 


836 


530 


13 


2304 


2069 


1369 


14 


5120 


5060 


3476 


15 


11264 


12 255 


8470 


16 


24 576 


29 444 


19 772 


17 


53 079 


64183 


43 062 


18 


109 396 


126 260 


83 346 


19 


103 665 


223 980 


147 474 


20 


262 144 


351956 


244 458 



increasing j, only the first few wj must be computed. Table 7.9 shows the wj 
for j — 0 to 20. The total error probability can now be computed with: 

^ oo 

a < ^ X w j 

j = 0 

Syndrome algebraic decoder. The syndrome decoder (Fig. 7.17) and en- 
coder (Fig. 7.18), like standard block decoders, computes a number of parity 
bits from the data sequence. The decoder’s newly computed parity bits are 
XOR’d with the received parity bits to create the “syndrome” word, which 
will be nonzero if an error occurs in transmission. The error position and 
value are determined from the syndrome value. In contrast to block codes, 
where only one generator polynomial is used, convolutional codes at data rate 
R = k/n have k + 1 generating polynomials. The complete generator may be 
written in a compact n x k generator matrix. For the encoder of Fig. 7.18 the 
matrix is 



(7.45) 



G(x) = [1 



x 21 + x 20 + X 19 + x 17 + x lb + x 13 + x 11 + l] . (7.46) 
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Fig. 7.17. Trial and error majority decoder with J — 8. 



For a systematic code the matrix has the form G(x) = [I:P(x)]. The 

parity check matrix H(x) = [— P(x) T :I] is easily computed, given that G x 
H t — 0. The desired syndrome vector is thus S — v x H T , where v is the 
received bit sequence. 

The syndrome decoder now looks up the calculated syndrome in a table 
to find the correct sequence. To keep the table small, only sequences with 
an error at the first bit position are included. If the decoder needs to correct 
errors of more than one bit, we cannot clear the syndrome after the correction. 
Instead, the syndrome value must be subtracted from a syndrome register (see 
the “Majority” signal in Fig. 7.17). 

A 22-bit table would be necessary for the standard convolutional decoder, 
but it is unfortunately difficult to implement a good FPGA look-up table 
with more than 4 to 11 bit addresses [163]. Majority codes, a special class of 
syndrome-decodable codes, offer an advantage here. This type of canonical 
self-orthogonal code (CSOC) has exclusively ones in the first row of the {Ak} 
parity check matrix (where the J columns are used as an orthogonal set to 
compute the syndrome) [167, p. 284]. Thus, every error in the first-bit position 





Coded bits 

iX 




Constraint length y - 22 ► 


DUS , | 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 






u u c 
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| ^ 


U 





Fig. 7.18. Systematic (44,22) encoder with rate R = 1/2 and constraint length 
v = 22 . 
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Table 7.10. Some majority-decodable “trial and error” codes [167, p. 406]. 



J 


£md 


V 


Generating polynomial 


Orthogonal equation 


2 


1 


2 


1 + X 


50, 


■ 5 1 


4 


2 


6 


1 + x 3 + x 4 + x 5 


50, 


53, 54 , Si + S 5 


6 


3 


12 


1 + X 6 + X 7 + X 9 + x 10 + x 11 


50, 


56, 57 , 5 g, 5 i +53 + 5io, 
54 + 58 + 5 1 1 


8 


4 


22 


1 + z 11 + x 13 + x 16 + x 17 + x w 


50, 


1 5n , 313, 816 , 517, 52 + 53 + 56 + 








+z 20 + z 21 




5 l 9,54 + Si 4 + S 20 ,5i + S 5 +S 8 + 
5 15 + 521 


10 


5 


aei + ^ + ^ + ^ + ^+x 29 


50, 


5 18 , 519, 527, 5i + 5g + S 28 , 5 10 + 








+*3° + *32 + *33 + *35 




520 + 529 , 5 1 1 + S 30 + 531 , 



513 + 521 + 523 + 532, 5i4 + 

533 + 534, 52 + 53 -(- Si6 + 524 + 
526 + 535 



will cause at least \J / 2] ones in the syndrome register. The decoding rule is 
therefore 



j = J 1 f ° r zLi^> r^/2i 

0 1 0 otherwise 



(7.47) 



Thus the name “majority code”: instead of the expensive syndrome table 
only a majority vote is needed. Massey [167, p. 289] has designed a class 
of majority codes, called trial and error codes, which, instead of evaluating 
the syndrome vector directly, manipulate a combination of syndrome bits to 
get a vector orthogonal to e* 0 . This small additional hardware cost results 
in slightly better error correction performance than the conventional CSOC 
codes. Table 7.10 lists some trial and error codes with data rate R — 1/2. 
Figure 7.17 shows a trial and error decoder with J = 8. Table 7.11 shows the 
complexity in CLBs of decoders with J — 4 to 10. 

Error-correction capability of the trial and error decoder. To calcu- 
late the error-correction performance of trial and error codes, we must first 



Table 7.11. Complexity in CLBs of a majority decoder with J — 4 to 10. 



Function 


J = 4 


J = 6 


J = 8 


J = 10 


Register 


6 


12 


22 


36 


XOR-Gate 


2 


4 


7 


11 


Majority-circuit 


1 


5 


7 


15 


Sum 


9 


22 


36 


62 
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Fig. 7.19. Performance comparison of Viterbi and majority decoders. 



note that in a window twice the constraint length, the codes allow up to 
[J/2_|-bit errors [126, p. 440]: 

LJ/2J /o \ 

P ( J ) = E ( k)p k ( l -P) 2 "~ k - (7.48) 

k = 0 ' ' 

A computer simulation of 10 6 bits, in Fig. 7.19, reveals good agreement 
with this equation. The equivalent single-error probability P B of an (n,k) 
code can be computed with 

P(J) = P(0) = (l - P B ) k (7.49) 

— y Pb = 1 — e ln(p(J))/fc . (7.50) 

Final comparison. Figure 7.19 shows the error-correction performance of 
Viterbi and majority decoders. For a comparable hardware cost (Viterbi, 
v — 2, df = 5, 67 CLBs and trial and error, t = 5, 62 CLBs) the better per- 
formance of the majority decoder, due to the greater constraint length per- 
mitted, is immediately apparent. The optimal MLSE property of the Viterbi 
algorithm cannot compensate for its short constraint length. 

7.2.4 Cryptography Algorithms for FPGAs 

Many communication systems use data-stream ciphers to protect relevant 
information, as shown in Fig. 7.20. The key sequence K is more or less a 
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Fig. 7.20. The principle of a synchronous data-stream cipher. 



“pseudorandom sequence” (known to the sender and the receiver), and with 
the modulo 2 property of the XOR function, the plaintext P can be recon- 
structed at the receiver side, because 

P 0 K 0 K = P 0 0 = P. (7.51) 

In the following, we compare an algorithm based on a linear-feedback 
shift register (LFSR) and a “data encryption standard” (DES) cryptographic 
algorithm. Neither algorithm requires large tables and both are suitable for 
an FPGA implementation. 

Linear Feedback Shift Registers Algorithm 

LFSRs with maximal sequence length are a good approach for an ideal se- 
curity key, because they have good statistical properties (see, for instance 
[180, 181]). In other words, it is difficult to analyze the sequence in a crypto- 
graphic attack, an analysis called cryptoanalysts . Because bitwise designs are 
possible with FPGAs, such LFSRs are more efficiently realized with FPGAs 
than PDSPs. Two possible realizations of a LFSR of length 8 are shown in 
Fig. 7.21. 

For the XOR LFSR there is always the possibility of the all-zero word, 
which should never be reached. If the cycle starts with any nonzero word, 
the cycle length is always 2* — 1. Sometimes, if the FPGA wakes up with 
an all-zero state, it is more convenient to use a “mirrored” or inverted LFSR 
circuit. If the all-zero word is a valid pattern and produces exactly the inverse 
sequence, it is necessary to substitute the XOR with a “not XOR” or XNOR 
gate. Such LFSRs can easily be designed using a PROCESS statement in VHDL, 
as the following example shows. 

Example 7.12: Length 6 LFSR 

The following VHDL code 2 implements a LFSR of length 6. 

2 The equivalent Verilog code If sr . v for this example can be found in Appendix A 
on page 478. 
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Fig. 7.21. Possible realizations of LFSRs. (a) Fibonacci configuration, (b) Galois 
configuration. 



LIBRARY ieee ; 

USE ieee . std_logic_1164 .ALL; 
USE ieee . std_logic_arith. ALL ; 



ENTITY If sr IS > Interface 

PORT ( elk : IN STD.LOGIC; 

y : OUT STD_L0GIC_VECT0R(6 DOWNTO 1)); 

END If sr ; 

ARCHITECTURE flex OF lfsr IS 

SIGNAL ff : STD_LOGIC_VECTOR (6 DOWNTO 1) ; 

BEGIN 

PROCESS — Implement length 6 LFSR with xnor 

BEGIN 

WAIT UNTIL elk = ’1’ ; 



ff(l) <= NOT (ff (5) XOR f f (6) ) ; 

FOR I IN 6 DOWNTO 2 LOOP — Tapped delay line: 
ff(I) <= ff(I-l); — shift one 

END LOOP; 

END PROCESS ; 

PROCESS (ff) 

BEGIN — Connect to I/O cell 

FOR k IN 1 TO 6 LOOP 
y(k) <= ff (k) ; 

END LOOP; 

END PROCESS; 

END flex; 

From the simulation of the design in Fig. 7.22, it can be concluded that the 
LFSR goes through all possible bit patterns, which results in the maximum 
sequence length 2 6 — 1 = 63 « 630 ns/ 10 ns. The design uses 6LCs and runs 
with a Registered Performance of 45.45 MHz. | 7.12 [ 
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Fig. 7.22. LFSR simulation. 



Note that a complete cycle of an LFSR sequence fulfills the three criteria 
for optimal length 2 l — 1 pseudorandom sequences defined by Golomb [182, 

p. 188]: 

1) The number of Is and 0s in a cycle differs by no more than one. 

2) Runs of length k (e.g., Ill • • sequence, 000 • • • sequence) have a total 
fractional part of all runs of 1/2*7 

3) The autocorrelation function C(r) is constant for r E [1, n — 1]. 

LFSRs are usually constructed from primitive polynomials in GF(2) us- 
ing the circuits shown in Fig. 7.21. Stahnke [168] has compiled a list of 
such primitive polynomials up to order 168. This paper is available online 
at http://www.jstor.org. With today’s available algebraic software pack- 
ages like Maple, Mupad, or Magma such a list can easily be extended. The 
following is a code example for Maple to compute the primitive polynomials 
of type x l + x a + 1 with the smallest a. 

with(numtheory) : 
for 1 from 2 by 1 to 45 do 
for a from 1 by 1 to 1-1 do 

if (Primitive(x~l+x~a+l) mod 2) then 
print (1 , a) ; 
break ; 

f i; 

od; 

od; 

Table 7.12 shows the necessary X0R list of the first 45 maximum length LFSRs 
according to Fig. 7.21a. For instance, the entry for polynomial fourteen (14, 
13, 11, 9) means the primitive polynomial is 

p ]4 (x) = X 14 + x 14 - 13 + x 14 ~ n + X 14 - 9 + 1 
= x 14 + x 5 + x 3 + X + 1. 

For l > 2 these primitive polynomials always have “twins,” which are 
also primitive polynomials [183]. These are the “time” reversed versions x l + 
*'-“ + 1. 
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Table 7.12. A list of the first 45 LFSR. 



/ 


Exponents 


l 


Exponents 


l 


Exponent 


]S 


1 




1 




16 


16, 


14, 


13, 


11 


31 




31, 


28 




2 




2, 


1 


17 




17, 


14 




32 


32, 


30, 


29, 


23 


3 




3, 


2 


18 




18, 


11 




33 




33, 


20 




4 




4, 


3 


19 


19, 


18, 


17, 


14 


34 


34, 


31, 


30, 


26 


5 




5, 


3 


20 




20, 


17 




35 




35, 


33 




6 




6, 


5 


21 




21, 


19 




36 




36, 


25 




7 




7, 


6 


22 




22, 


21 




37 


37, 


36, 


33, 


31 


8 


8, 


6, 


5, 4 


23 




23, 


18 




38 


37, 


36, 


33, 


31 


9 




9, 


5 


24 


24, 


23, 


21, 


20 


39 




39, 


35 




10 




10, 


7 


25 




25, 


22 




40 


40, 


37, 


36, 


35 


11 




11, 


9 


26 


26, 


25, 


24, 


30 


41 




41, 


38 




12 


12, 


11 


, 8, 6 


27 


27, 


26, 


25, 


22 


42 


42, 


39, 


38, 


35 


13 


13, 


12, 


10, 9 


28 




28, 


25 




43 


43, 


41, 


40, 


36 


14 


14, 


13, 


11, 9 


29 




29, 


27 




44 


44, 


42, 


41, 


37 


15 


] 


L5, 


14 


30 


30, 


29, 


26, 


24 


45 


45, 


44, 


43, 


41 



Stahnke [18, XAPP52] has computed primitive polynomials of type x l + 
x a + 1. There are no primitive polynomials with four elements, i.e. (x l + 
x b + x a + 1) for / < 45. But it is possible to find polynomials of the type 
x l + x a + b -f x b + x a + 1, which Stahnke used for those l where a polynomial 
of the type x l + x a + 1 (/ = 8, 12, 13, etc.) does not exist. 

The LFSRs with four elements in Table 7.12 were computed to have the 
maximum sum (i.e., a + b) for the tap exponents. We will see later that, for 
multistep LFSR implementations, this usually gives the minimum complexity. 

If n random bits are used at once, it is possible to clock our LFSR n 
times. In general, it is not a good idea to use just the lowest n bits of our 
LFSR, since this will lead to weak random properties, i.e., low cryptographic 
security. But it is possible to compute the equation for 77 -bit shifts, so that 
only one clock cycle is needed to generate n new random bits. The necessary 
equation can be computed more easily if a “state-space” description of the 
LFSR is used, as the following example shows. 

Example 7.13: Three Steps-at-Once LFSR 

Let us assume a primitive polynomial of length 6, e.g.,p = x 6 -\-x + l, is used to 
compute random sequences. The task now is to compute three “new” bits in 
one clock cycle. To obtain the required equation, the state-space description 
of our LFSR must first be computed, i.e., x(t -f 1) = Ax(t) 

"^(t + l)] ["0 1 0 0 0 01 r 076(0’ 

s 5 (*+l) 0 0 1 0 0 0 x 5 (t) 

x 4 (t-\- 1) _ 0 0 0 1 0 0 x 4 (t) 

x 3 (t+ 1) “ 0 0 0 0 1 0 x 3 (t) ’ < 7 - 52 ) 

0 0 0 0 0 1 x 2 (t) 

_a;i(£ + 1)J L 1 1 0 0 0 °J Ui(0_ 

With this state-space description, the actual values x(t ) and the transition 
matrix A are used to compute the new values x(t + 1). To compute the 
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values for x(t-\- 2), simply compute x(t-\- 2) = Ax(t+1) = A 2 x(t). The next 
iteration gives x(t - f 3) = A s x(t). The equations for an n-step-at-once LFSR 
can therefore be computed by evaluating A n mod 2. For n — 3 it follows that 



A 3 mod 2 = 



0 0 0 1 0 0 
0 0 0 0 1 0 
0 0 0 0 0 1 
1 1 0 0 0 0 
0 110 10 
0 0 110 1 



(7.53) 



As expected, for the register xq to there is a shift of three positions, while 
the other three values x\ to 13 are computed using an EXOR operation. The 
following VHDL code 3 implements this three-step LFSR. 



LIBRARY ieee ; 



USE ieee . std_logic_1164. ALL; 
USE ieee . std_logic_arith. ALL ; 



ENTITY If sr6s3 IS > Interface 

PORT ( elk : IN STD.LOGIC; 

y : OUT STD_L0GIC_VECT0R(6 D0WNT0 1)); 

END If sr6s3 ; 



ARCHITECTURE flex OF Ifsr6s3 IS 

SIGNAL ff : STD_L0GIC_VECT0R(6 D0WNT0 1); 

BEGIN 

PROCESS — Implement three-step length-6 LFSR with xnor 
BEGIN 

WAIT UNTIL elk = ’1’ ; 
ff (6) <= ff (3) ; 
ff (5) <= ff (2) ; 
ff (4) <= ff (1) ; 

f f (3) <= NOT (ff (5) X0R f f (6) ) ; 
f f (2) <= NOT (f f (4) X0R ff (5) ) ; 
ff(l) <= NOT (f f (3) X0R ff (4) ) ; 

END PROCESS ; 

PROCESS (ff) 

BEGIN — Connect to 1/0 cell 

FOR k IN 1 TO 6 LOOP 
y (k) <= ff (k) ; 

END LOOP; 

END PROCESS; 

END flex; 

Figure 7.23 shows a simulation for the three-step LFSR design. Comparing 
the simulation of this LFSR in Fig. 7.23 with the simulation of the single-step 
LFSR in Fig. 7.22, it can be concluded that now every third sequence value 

3 The equivalent Verilog code Ifsr6s3.v for this example can be found in Ap- 
pendix A on page 478. 
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Fig. 7.23. Multistep LFSR simulation. 



occurs. The cycle length is reduced from 2 6 — 1 to (2 6 — l)/3 = 21. The design 
uses 6 LCs and runs with a Registered Performance of 43.85 MHz. | 7.13 | 



To implement such a multistep LFSR, we want to select the primitive 
polynomial that results in the lowest circuit effort, which can be computed 
by counting the nonzero entries in the A k mod 2 matrix, and/or the max- 
imum fan-in for the register, which corresponds to the number of ones in 
each row. For a few shifts, the fan-in for the circuit from Fig. 7.21a may be 
advantageous. It can also be observed 4 that if the feedback signals are close 
in the A matrix, some entries in the A k matrix may become zero, due to 
the modulo 2 operations. As mentioned earlier, the two and four-tap LFSR 
data in Table 7.12 were therefore computed to yield the maximum sum of all 
taps. For the same sum, the primitive polynomial that has the larger value 
for the smallest tap was selected, e.g., (11, 12) is better than (10, 13). This 
was chosen because tap / is mandatory for the maximum-length LFSR, and 
the other values should be close to this tap. 

If, for instance, Stanke’s si 4 >a (^) = «£ 14 + £ 12 + x 11 + x + 1 primitive 
polynomial is used, this will result in 58 entries for an n = 8 multistep LFSR, 
while if the LFSR from Table 7.12, P 14 = x 14 + x 5 + x 3 + x l + 1 (i.e., taps 
14,13,11,9) is used, the A 8 mod 2 matrix has only 35 entries (Exercise 7.6, 
p. 362). Fig. 7.24 shows the total number of ones for the LFSR for the two 
polynomials with the two different implementations from Fig. 7.21, while 
Fig. 7.25 shows the maximum fan-in (i.e., the maximum needed input bit 
width for a LC) for this LFSR. It can be concluded from the two figures that 
a careful choice of the polynomial and LFSR structure can provide substantial 
savings. For the multistep LFSR synthesis, it can be seen from Fig. 7.25 that 
the LFSR of Fig. 7.21b has fewer fan-ins (i.e., smaller LC input bit width), but 
for longer multistep &, the effort seems similar for the primitive polynomials 
from Table 7.12. 

4 It is obviously not applicable to select the LFSR with the smallest implemen- 
tation effort, because there are </>(2 l — 1)// primitive polynomials, where is 
the Euler function that computes the number of coprimes to x. For instance, a 
16-bit register has </>( 2 16 — 1)/ 16 = 2048 different primitive polynomials [183]! 
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Combining LFSR 

An additional gain in performance in cryptographic security can be achieved 
if several LFSR registers are combined into one key generator. Several linear 
and nonlinear combinations exist [170], [169, pp. 150—173]. Meaningful for 
implementation effort and security are nonlinear combinations with thresh- 
olds. For a combination of three different LFSRs with length L i, L 2 , and 
L 3 the linear complexity , which is the equivalent length of one LFSR (which 
may be synthesized with the Berlekamp— Massey algorithm, for instance [169, 
pp. 141—9]), provides 

Lges =: x L 2 H- L 2 x L 3 + L\ x L 3 . (7.54) 

Figure 7.26 shows a realization for such a scheme. 

Since the key in the selected paging format has 50 bits, a total length of 
2 x 50 = 100 registers was chosen, and the three feedback polynomials are: 

P 33 (*e) = £ 33 + + x + 1 (7.55) 

P 2 d(x) = x 29 + x 2 + l (7.56) 

P 3 s(^) = x 38 + x 6 + x 5 + x + 1. (7-57) 
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All the polynomials are primitive , which guarantees that the length of all 
three shift-register sequences gives a maximum. For the linear complexity of 
the combination it follows that: 

L x = 33; L 2 = 29; L 3 = 38 
Ltotai = 33 X 29 + 33 x 38 + 29 x 38 = 3313. 



Table 7.13. Cost, measured in CLBs, of a 3K Xilinx FPGA. 



Function group 


CLBs 


50-bit key register 


25 


100-bit shift register 


50 


Feedback 


3 


Threshold 


0.5 


XOR with message 


0.5 


Total 


79 
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k 0 




Fig. 7.26. Realization of the data-stream cipher with 3 LFSR. 

After each coding the key is lost, and an additional 50 registers are needed 
to store the key. The 50-bit key is used twice. Table 7.13 shows the hardware 
resources required with Xilinx FPGAs of the 3K family. 

DES based algorithm. The data encryption standard (DES), outlined in 
Fig. 7.27, is typically used in a block cipher. By selecting the “output feedback 
mode” (OFB) it is also possible to use the modified DES in a data-stream 
cipher (see Fig. 7.28). The other modes (ECB, CBC, or CFB) of the DES are, 
in general, not applicable for communication systems, due to the “avalanche 
effect” : A single-bit error in the transmission will alter approximately 50% of 
all bits in a block. 

We will review the principles of the DES algorithm and then discuss 
suitable modifications for FPGA implementations. 

The DES comprises a finite-state machine translating plaintext blocks 
into ciphertext blocks. First the block to be substituted is loaded into the 
state register (32 bits). Next it is expanded (to 48 bits), combined with the 
key (also 48 bits) and substituted in eight 6— >T bit-width S-boxes. Finally, 
permutations of single bits are performed. This cycle may be (if desired, 
with a changing key) applied several times. In the DES, the key is usually 
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Fig. 7.27. State machine for a block encryption system (DES). 



shifted one or two bits so that after 16 rounds the key is back in the original 
position. Because the DES can therefore be seen as an iterative application 
of the Feistel cipher (shown in Fig. 7.29), the S-boxes must not be invertible. 
To simplify an FPGA realization some modifications are useful, such as a 
reduction of the length of the state register to 25 bits. No expansion is used. 
Use the final permutations as listed in Table 7.14. 

Because most FPGAs only have four to five input look-up tables (LUTs), 
S-boxes with five inputs have been designed, as displayed in Table 7.15. 

Although the intention was to use the OFB mode only, the S-boxes in 
Table 7.16 were generated in such a manner that they can be inverted. The 
modified DES may therefore also be used as a normal block cipher (electronic 
code book). 
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Fig. 7.28. Block cipher in the OFB-mode used as data-stream cipher. 




Fig. 7.29. Principle of the Feistel network. 

A reasonable test for S-boxes is the dependency matrix. This matrix 
shows, for every input /output combination, the probability that an output 
bit changes if an input bit is changed. With the avalanche effect the ideal 
probability is 1/2. Table 7.16 shows the dependency matrix for the new five 
S-boxes. Instead of the probability, the table shows the absolute number of 
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Table 7.14. Table for permutation. 

From bit no. 0123 4 5 67 8 9 10 11 12 

To bit no. 20 4 5 10 15 21 0 6 11 16 22 I T 

From bit no. 13 14 15 16 17 18 19 20 21 22 23 24 

To bit no. 12 17 23 2 8 13 18 24 3 9 14 19 



Table 7.15. The five new designed substitution boxes (S-boxes). 



Input 


Box 1 


Box 2 


Box 3 


Box 4 


Box 5 


0 


IE 


F 


14 


19 


6 


1 


13 


1 


ID 


14 


E 


2 


14 


13 


16 


D 


1A 


3 


1 


IF 


B 


4 


3 


4 


1A 


19 


5 


1C 


B 


5 


IB 


1C 


E 


1A 


IE 


6 


E 


12 


8 


IE 


0 


7 


B 


11 


F 


1 


2 


8 


D 


8 


4 


C 


ID 


9 


10 


7 


C 


F 


C 


A 


3 


IB 


IE 


IB 


18 


B 


0 


0 


13 


ID 


17 


C 


4 


1A 


10 


5 


1 


D 


6 


C 


1 


15 


15 


E 


A 


ID 


18 


E 


IB 


F 


17 


2 


17 


13 


9 


10 


19 


B 


1C 


17 


19 


11 


16 


IE 


A 


9 


A 


12 


7 


18 


IB 


3 


4 


13 


1C 


D 


3 


10 


14 


14 


ID 


5 


19 


A 


13 


15 


5 


14 


D 


16 


11 


16 


2 


15 


0 


12 


10 


17 


IF 


9 


2 


IF 


12 


18 


F 


3 


15 


B 


5 


19 


11 


10 


6 


2 


F 


1A 


C 


6 


7 


6 


8 


IB 


18 


17 


12 


18 


16 


1C 


9 


4 


IF 


11 


1C 


ID 


15 


16 


1A 


8 


7 


IE 


8 


E 


9 


7 


D 


IF 


12 


A 


11 


0 


IF 



occurrences. Since there are 2 5 = 32 possible input vectors for each S-box, 
the ideal value is 16. A random generator was used to generate the S-boxes. 
The reason that some values differ much from the ideal 16 may lie in the 
desired inversion. 

The hardware effort of the DES-based algorithm is summarized in Table 
7.17. 

Cryptographic performance comparison. We will next discuss the cryp- 
tographic performance analysis of the LFSR- and DES-based algorithms. Sev- 
eral security tests have been defined and the following comparison shows the 
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Table 7.16. Dependency matrix for the five substitution boxes (ideal value is 16). 



Box 1 Box 2 Box 3 



20 


12 


20 


20 


20 




20 


16 


20 


12 


20 




20 


12 


16 


16 


16 


12 


20 


12 


16 


16 




20 


20 


20 


16 


16 




16 


20 


16 


16 


16 


12 


16 


16 


12 


8 




12 


20 


20 


16 


8 




16 


16 


20 


12 


12 


16 


16 


20 


12 


16 




16 


24 


12 


16 


12 




16 


8 


12 


16 


20 


20 


16 


20 


12 


12 




16 


20 


16 


20 


20 




20 


12 


12 


20 


12 



Box 4 Box 5 



20 


16 


20 


20 


16 




12 


20 


8 


12 


20 


12 


16 


12 


16 


20 




20 


12 


16 


24 


20 


20 


16 


16 


20 


16 




16 


12 


12 


20 


16 


20 


16 


16 


20 


24 




16 


20 


16 


20 


12 


16 


12 


28 


20 


16 




12 


16 


16 


12 


24 



Table 7.17. Hardware effort of the modified DES based algorithm. 



Function group 


CLBs 


25-bit key register 


12.5 


25-bit additions 


12.5 


25-bit state register 


12.5 


Five S-boxes 5-»5 


25 


Permutation 


0 


25-bit initialization vector 


12.5 


Multiplex: Initialization vector/S-box 


12.5 


XOR with message 


1 


Total 


87.5 



two most interesting (the others do not show clear differences between the 
two schemes). For both tests, 100 random keys were generated. 

1) Using different keys, the generated sequences were analyzed. In each ran- 
dom key, one bit was changed and the number of bit changes in the 
plaintext was recorded. On average, about 50% of the bits should be 
inverted (avalanche effect). 

2) Similar to Test 1, but this time the number of changes in the output 
sequence were analyzed, depending on the changed position of the key. 
Again, 50% of the bits should change in sign. 

For both tests, plaintext with 64-bit length were used (see again Fig. 7.20, 
p. 325). The plaintext is arbitrary. For Test 1, all variations over each individ- 
ual key position were accumulated. For Test 2, all changes depending on the 
position in the output sequence were accumulated. This test was performed 
for 100 random keys. For Test 1, the ideal value is 64 x 0.5 x 100 = 3200 and 
for Test 2 the optimal value is 50 x 0.5 x 100 = 2500. Figures 7.30 and 7.31 
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Fig. 7.30. Results of Test 1. 



display the results. They clearly show that the DES-OFB scheme is much 
more sensitive to changes in the key than is the scheme with three LFSRs. 
The conclusion from Test 2 is that the SR scheme needs about 32 steps until a 
change in the key will affect the output sequence. For the DES-OFB scheme, 
only the first four samples differ considerably from the ideal value of 2500. 




Fig. 7.31. Results of Test 2. 
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Due to the superior test results, the DES-OFB scheme may be preferred 
over the LFSR scheme. 

A final note on encryption security. In general, it is not easy to conclude 
that an encryption system is secure. Besides the fact that a key may be stolen, 
the fact that a fast crack algorithm is not now known does not prove that 
there are no such fast algorithms. There is also the problem of a “brute 
force attack” using more powerful computers and/or parallel attacks. A good 
example is the 56-bit key DES algorithm, which was the standard for many 
years but was finally declared insecure in 1997. The DES was first cracked 
by a network of volunteer computer owners on the Internet, which cracked 
the key in 39 days. Later, in July 1997, the Electronic Frontier Foundation 
(EFF) finished the design of a cracker machine. It has been documented in 
a book [184], including all schematics and software source code, which can 
be downloaded from http://www.eff.org/. This cracker machine performs 
an exhaustive key search and can crack any 56-bit key in less than five days. 
It was built out of custom chips, each of which has 24 cracker units. Each 
of the 29 boards used consists of 64 “Deep Crack” chips, i.e., a total of 1856 
chips, or 44 544 units, are in use. The system cost was $250,000. When DES 
was introduced in 1977 the system costs were estimated at $20 million, which 
corresponds to about $40 million today. This shows a good approximation to 
“Moore’s law,” which says that every 18 months the size or speed or price 
of microprocessors improves by a factor 2. From 1977 to 1998 the price of 
such a machine should drop to 40 x 10 6 /2 22 / 1 5 « fifteen hundred dollars, i.e., 
it should be affordable to build a DES cracker today (as was proven by the 
EFF). 



Encryption 




Decryption 



Fig. 7.32. Triple DES (K i = keys; E=single Encryption; D=single Decryption). 
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Table 7.18. Encryption algorithms [185]. 



Algorithm 


Key 

size 

(bits) 


Mathematical 

operations/ 

principle 


Sym- 

me- 

try 


Developed 

by 

(year) 


DES 


56 


XOR, fixed 5-boxes 


s 


IBM (1977) 


Triple DES 


122 — 168 XOR, fixed 5-boxes 


s 




AES 


128 — 256 XOR, fixed 5-boxes 


s 


Daemen/Rijmen (1998) 


RSA 


variable 


Prime factors 


a 


Ri vest / Shamir / 
Adleman (1977) 


IDEA 


128 


XOR, add., mult. 


s 


Massey/Lai (1991) 


Blowfish 


< 448 


XOR, add. 
fixed S-boxes 


s 


Schneider (1993) 


RC5 


< 2048 


XOR, add., rotation 


s 


Rivest (1994) 


CAST-128 


40 - 128 


XOR, rotation, 
S-boxes 


s 


Adams/Tavares (1997) 



Therefore the 56-bit DES is no longer secure, but it is now common to 
use triple DES, as displayed in Fig. 7.32, or other 128-bit key systems. Table 
7.18 shows that these systems seem to be secure for the next few years. The 
EFF cracker, for instance, today will need about 5 x 2 112 days, or 7 x 10 31 
years, to crack the triple DES. 

The first column in Table 7.18 is the commonly used abbreviations for the 
algorithms. The second and third columns contain the typical parameters of 
the algorithm. Symmetric algorithms (designated in the fourth column with 
an V) are usually based on FeisteFs algorithm, while asymmetric algorithms 
can be used in a public/private key system. The last column displays the name 
of the developer and the year the algorithm was first published. 
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7.3 Modulation and Demodulation 

For a long time the goal of communications system design was to realize a 
fully digital receiver, consisting of only an antenna and a fully programmable 
circuit with digital filters, demodulators and/or decoders for error correction 
and cryptography on a single programmable chip. With today’s FPGA gate 
count above one million gates this has become a reality. “FPGAs will clearly 
be a key technology for communication systems well into the 21 st century” 
as predicted by Carter [186] . In this section, the design and implementation 
of a communication system is developed in the context of FPGAs. 



7.3.1 Basic Modulation Concepts 

A basic communication system transmits and receives information broadcast 
over a carrier frequency, say /q. This carrier is modulated in amplitude, fre- 
quency or phase, proportional to the signal x(t) being transmitted. Figure 
7.33 shows a modulated signal for a binary transmission. For binary trans- 
mission, the modulations are called amplitude shift keying (ASK), phase shift 
keying (PSK), and frequency shift keying (FSK). 

In general, it is more efficient to describe a (real) modulated signal with 
a projection of a rotating arrow on the horizontal axis, according to 




Fig. 7.33. ASK, PSK, and FSK modulation. 
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Fig. 7.34. Modulation in the complex plan. 



s(t) = Sj{ ^A{t)e j{2lTfot+A ‘ l,it)+,t ‘°) j 

= A(t) cos(27r f 0 t + A<j>(t) + 4> o), (7.58) 

where <j>o is a (random) phase offset, A{i) describes the part of the amplitude 
envelope, and A<j>{t) describes the frequency- or phase-modulated component, 
as shown in Fig. 7.34. As can be seen from (7.58), AM and PM/FM can be 
used separately to transmit different signals. 

An efficient solution (that does not require large tables) for realizing uni- 
versal modulator is the CORDIC algorithm discussed in Chap. 2 (p. 94). 
The CORDIC algorithm is used in the rotation mode, i.e., it is a coordinate 
converter from (R,0) — > (X,Y). Figure 7.35 shows the complete modulator 
for AM, PM, and FM. 

To implement amplitude modulation, the signal A(t) is directly connected 
with the radius R input of the CORDIC. In general, the CORDIC algorithm 
in rotation mode has an attendant linear increase in the radius. This corre- 
sponds to a change in the gain of an amplifier and need not be taken into 
consideration for the AM scheme. When the linear increased radius (factor 
1.6468, see Table 2.1, p. 35), is not desired, it is possible either to scale the 
input or the output by 1/1.6468 with a constant coefficient multiplier. 




.3 Modulation and Demodulation 



343 




Fig. 7.35. Universal modulator using CORDIC. 

The phase of the transmitted signal 9 = 27r/o^ + A(j)(t) must also be 
computed. To generate the constant carrier frequency, a linearly increasing 
phase signal according to 27r/ot must be generated, which can be done with 
an accumulator. If FM should be generated, it is possible to modify /o by A /, 
or to use a second accumulator to compute 27 rAft, and to add the results of 
the two accumulators. For the PM signal, a constant offset (not increasing in 
time) is added to the phase of the signal. These phase signals are added and 
applied to the angle input z or 0 of the CORDIC processor. The Y register 
is set to zero at the beginning of the iterations. 

The following example demonstrates a fully pipelined version of the 
CORDIC modulator. 

Example 7.14: Universal Modulator using CORDIC 

A universal modulator for AM, PM, and FM according to Fig. 7.35, can be 
designed with the following VHDL code 5 of the CORDIC part. 

PACKAGE nine_bit_int IS — User defined types 
SUBTYPE NINE_BIT IS INTEGER RANGE -256 TO 255; 

TYPE ARRAY_NINE_BIT IS ARRAY (0 TO 3) OF NINE_BIT ; 

END nine_bit_int ; 

LIBRARY work; 

USE work .nine_bit_int .ALL; 

LIBRARY ieee ; 

USE ieee . std_logic_l 164 . ALL; 

USE ieee . std_logic_arith. ALL; 

ENTITY ammod IS > Interface 

PORT (elk : IN STD_L0GIC; 

r_in , phi.in : IN NINE.BIT; 
x_out , y_out , eps : OUT NINE_BIT) ; 

END ammod; 

5 The equivalent Verilog code ammod. v for this example can be found in Ap- 
pendix A on page 479. 
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ARCHITECTURE flex OF ammod IS 
BEGIN 



PROCESS 

VARIABLE x, y, z 


> Behavioral Style 

: ARRAY_NINE_BIT; — Tapped delay line 


BEGIN 


WAIT UNTIL elk = ’ 


1’ ; 


— Compute last value first 


x_out < 


= x (3) ; 




— in sequential statements ! ! 


eps < 

y_out < 

IF z (2) 


= z (3) ; 

= y(3) ; 

> 0 THEN 




— Rotate 14 degrees 


x(3) 


= x (2) - 


y (2) 


/ 4 ; 


y(3) 


= y (2) + 


x (2) 


/ 4 ; 


z(3) 


= z (2) - 


14; 




ELSE 


x(3) 


= x (2) + 


y (2) 


/ 4 ; 


y(3) 


= y (2) - 


x (2) 


/ 4 ; 


z (3) 


= z(2) + 


14; 




END IF; 


IF z ( 1 ) 


> 0 THEN 




— Rotate 26 degrees 


x (2) 


= x(l) - 


yd) 


/2 ; 


y(2) 


= y(D + 


xd) 


/2 ; 


z (2) 


= z(l) - 


26; 




ELSE 


x(2) 


= xd) + 


y(l) 


/2 ; 


y(2) 


= y(l) - 


x(l) 


/ 2 ; 


z (2) 


= z ( 1 ) + 


26; 




END IF; 


IF z (0) 


> 0 THEN 




— Rotate 45 degrees 


x(l) 


= x (0) - 


y(0) ; 


y(l) 


= y (0) + 


x (0) ; 




z ( 1 ) 


= Z (0) - 


45; 




ELSE 


x ( 1 ) 


+ 

o 

X 

II 


y(0); 




yd) 


= y(o) - 


x (0) ; 




z (1) 


= z (0) + 


45; 




END IF; 


IF phi_in > 90 


THEM 


— Test for |phi_in| > 90 


x (0) 


= 0; 




— Rotate 90 degrees 


y (0) 


= r_ in ; 




— Input in register 0 


z (0) 


= phi_ in 


- 90; 





ELSIF phi.in < -90 THEN 
x (0) := 0; 
y(0) := - r_in; 
z(0) := phi_in + 90; 
ELSE 

x(0) := r_in; 
y(0) := 0; 
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Fig. 7.36. Simulation of an AM modulator using the CORDIC algorithm. 



z(0) := phi_in; 

END IF; 

END PROCESS; 

END flex; 

Figure 7.36 reports the simulation of an AM signal. Note that the Altera 
simulation does not produce signed data, but rather unsigned binary data 
(where negative values have a 512 offset). A pipeline delay of four steps is 
seen and the value 100 is enlarged by a factor of 1.6. A switch in radius r_in 
from 100 to 25 results in the maximum value x_out dropping from 163 to 42. 
The CORDIC modulator runs at 28.49 MHz and uses 279 LCs. I 7.14 I 



Demodulation may be coherent or incoherent. A coherent receiver must 
recover the unknown carrier phase (j) 0 , while an incoherent one does not need 
to do so. If the receiver uses an intermediate frequency (IF) band, this type 
of receiver is called a superhet or double superhet (two IF bands) receiver. 
IF receivers are also sometimes called heterodyne receivers. If no IF stages 
are employed, a zero IF or homodyne receiver results. Figure 7.37 presents a 
systematic overview of the different types of receivers. Some of the receivers 
can only be used for one modulation scheme, while others can be used for 
multiple modes (e.g., AM, PM, and FM). The latter is called a universal 
receiver. We will first discuss the incoherent receiver, and then the coherent 
receiver. 

All receivers use intensive filters (as discussed in Chaps. 3 and 4), in order 
to select only the signal components of interest. In addition, for heterodyne 
receivers, filters are needed to suppress the mirror frequencies, which arise 
from the frequency shift 

s(t) X cos(27T f m t) < ► S(f + /m) + S(f - / m ). (7.59) 

7.3.2 Incoherent Demodulation 

In an incoherent demodulation scheme, it is assumed that the exact carrier 
frequency is known to the receiver, but the initial phase <j>o is not. 
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Demodulator 




incoherent f 

1 


coherent 

— 1 


Envelop Detector (only AM) 


▼ 

Phased locked Loop (PLL) 


Limiter Discriminator (only FM) 


Costas Loop (CL) 


Quadratur Mixer \ 




Quadratur Sampler 1 in combination 




Hilbert-Transformer j with CORDIC 




Hilbert Sampler J 





Fig. 7.37. Coherent and incoherent demodulation schemes. 



If the signal component is successfully selected with digital or analog fil- 
tering, the question arises whether only one demodulation mode (e.g., AM or 
FM) or universal demodulator is needed. An incoherent AM demodulator can 
be as simple as a full or half-wave rectifier and an additional lowpass filter. 
For FM or PM demodulation, only the limiter /discriminator type of demod- 
ulator is an efficient implementation. This demodulator builds a threshold of 
the input signal to limit the values to ±1, and then basically '‘measures 11 the 
distance between the zero crossings. These receivers are easily implemented 
with FPGAs but sometimes produce 2 tt jumps in the phase signal (called 
“clicks 11 [187, 188]). There are other demodulators with better performance. 

We will focus on universal receivers using in-phase and quadrature compo- 
nents. This type of receiver basically inverts the modulation scheme relative 
to (7.58) from p. 342. In a first step we have to compute, from the received 
cosines, components that are “in-phase 11 with the sender’s sine components 
(which are in quadrature to the carrier, hence the name Q phase). These I 
and Q phases are used to reconstruct the arrow (rotating with the carrier 
frequency) in the complex plane. Now, the demodulation is just the inversion 
of the circuit from Fig. 7.35. It is possible to use the CORDIC algorithm in 
the vectoring mode, i.e., a coordinate conversion X,Y R, 0 with I = X 
and Q — Y is used. Then the output R is directly proportional to the AM 
portion, and the PM/FM part can be reconstructed from the 6 signal, i.e., 
the Z register. 

A difficult part of demodulation is I/Q generation, and typically two meth- 
ods are used: a quadrature scheme and a Hilbert transform. 

In the quadrature scheme the input signal is multiplied by the two mixer 
signals, 2cos(27r f m t) and — j2 sin(27r/ m Q. If the signals in the IF band /if = 
/o — fm are now selected with a filter, the complex sum of these signals is 
then a reconstruction of the complex rotating arrow. Figure 7.38 shows this 
scheme while Fig. 7.39 displays an example of the I/Q generation. From the 
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Fig. 7.38. Generation of I- and Q-phase using quadrature scheme. 



spectra shown in Fig. 7.39 it can be seen that the final signal has no negative 
spectral components. This is typical for this type of incoherent receiver and 
these signals are called analytic. 

To decrease the effort for the filters, it is desirable to have an IF fre- 
quency close to zero. In an analog scheme (especially for AM) this often 
introduces a new problem, that the amplifier drifts into saturation. But for a 
fully digital receiver, such a homodyne or zero IF receiver can be built. The 
bandpass filters then reduce to lowpass filters. Hogenauer’s CIC filters (see 
Chap. 5, p. 187) are efficient realizations of these high decimation filters. Fig- 




Fig. 7.39. Spectral example of the I/Q generation. 
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Fig. 7.40. Spectra for the zero IF receiver. Sampling frequency was 2tt. 



ure 7.40 shows the corresponding spectra. The real input signal is sampled 
at 27T. Then the signal is multiplied with a cosine signal S^cosl^) and a 
sine signal S'-^sin This produces the in-phase component Si(e^) and 

the quadrature component jSq(e^ UJ ). These two signals are now combined 
into a complex analytic signal S\ + JSq. After the final lowpass filtering, a 
decimation in sampling rate can be applied. 

Such a fully digital zero IF for LF has been built using FPGA technology 
[189], 

Example 7.15: Zero IF Receiver 

This receiver has an antenna, a programmable gain adjust (AGC), and a 
Cauer lowpass 7 th -order followed by an 8-bit video A/D converter. The re- 
ceiver uses eight times oversampling (0.4-1. 2 MHz) for the input range from 
50 to 150 kHz. The quadrature multipliers are 8 x 8-bit array multipliers. 
Two-stage CIC filters were designed with 24- and 19-bit integrator precision, 
and 17- and 16-bits precision for the comb sections. The final sampling rate 
reduction was 64. The full design could fit on a single XC3090 Xilinx FPGA. 
The following table shows the effort for the single units: 
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Fig. 7.41. PLL with accumulator as reference. 



HP 5371 A Frequency And Time Interval Analyzer 



HP 5371 A Frequency And Time Interval Analyzer 



TVar: Frequency A 11 Aug 1992 17:58:08 

A Mkr x: 5.429999686 s 5464484 A evts 





(a) 



(b) 



Fig. 7.42. PLL synthesizers with accumulator reference, (a) Behavior of the syn- 
thesizer for switching F out from 900 kHz to 1.2 MHz. (b) Histogram of the frequency 
error, which is less than 2 Hz. 



Design part CLBs 



Mixer with sin/cos tables 74 

Two CIC filters 168 

State machine and PDSP interface 18 

Frequency synthesizer 32 



Total 292 



For the tunable frequency synthesizer an accumulator as reference for an 
analog phase-locked loop (PLL) was used [4]. Figure 7.41 shows this type 
of frequency synthesizer and Fig. 7.42 displays the measured performance of 
the synthesizer. The accumulator synthesizer could be clocked very high due 
to the fact that only the overflow is needed. A bitwise carry save adder was 
therefore used. The accumulator was used as a reference for the PLL that 
produces F ou t = M 2 F( n = MiM 2 F[ n /2 N . | 7.15 | 
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Fig. 7.43. Hilbert transformer, (a) Filter, (b) Spectrum of H(f). 



The Hilbert transformer scheme relies on the fact that a sine signal can 
be computed from the cosine signal by a phase delay of 90°. If a filter is used 
to produce this Hilbert transformer, the amplitude of the filter must be one 
and the phase must be 90° for all frequencies. Impulse response and transfer 
function can be found using the definition of the Fourier transform, i.e., 



h(t) = 



1 

7 Tt 



H(ju) = -h{u) 



j oo< ijj < 0 

— j 0 < UJ <oc 



(7.60) 



with 7 ( 0 ;) = — 1 V uj < 0 and 7 (u) = 1 V uj > 0 as the sign function. A Hilbert 
filter can only be approximated by an FIR filter and resulting coefficients have 
been reported (see for instance [190, 191], [ 122 , pp. 168-174], or [67, p. 681]). 

Simplification for narrowband receivers. If the input signals are nar- 
rowband signals, i.e., the transmitted bit rate is much smaller than the carrier 
frequency, some simplifications in the demodulation scheme are possible. In 
the input sampling scheme it is then possible to sample at the carrier rate, or 
at a multiple of the period T 0 = l//o of the carrier, in order to ensure that 
the sampled signals are already free of the carrier component. 

The quadrature scheme becomes trivial if the zero IF receiver samples 
at 4/ 0 . In this case, the sine and cosine components are elements of 0, 1 
or -1, and the carrier phase is 0,90°, 180° .... This is sometimes referred 
to as “complex sampling” in the literature [192, 193]. It is possible to use 
undersamphng, i.e., only every second or third carrier period is evaluated 
by the sampler. Then the sampled signal will still be free from the carrier 
frequency. 

The Hilbert transformer can also be simplified if the signal is sampled 
at To/4. A Hilbert sampler of first order with Q_ 1 = 1 or a second-order 
type using the symmetric coefficients Q 1 = — 0 . 5 ; Q_i = 0.5 or asymmetric 
Q-i = 1.5; Q- 3 = 0.5 coefficients can be used [194]. 
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Table 7.19. Coefficients of the Hilbert sampler. 



Type 




Coefficients 






Bit 


Aflh 


Zero 




Q- 1 = i,o 






8 


0.005069 


order 




Q-i = i,o 






12 


0.000320 






Q-i = i,o 






16 


0.000020 


First- 


Q- 


i = 1, 5;Q_s = 


= 0.5 


8 


0.032805 


order 


Q- 


! = 1,5;Q_ 3 = 


= 0.5 


12 


0.008238 


asymmetric 


Q- 


i = 1,5;Q_3 = 


= 0.5 


16 


0.002069 


First- 


Qi 


= -0.5;Q_i = 


= 0.5 


8 


0.056825 


order 


Q i 


= — 0.5;Q_i = 


= 0.5 


12 


0.014269 


symmetric 


Q i 


= — 0.5;Q_i = 


= 0.5 


16 


0.003584 



S&H MPX 




(b) 



Fig. 7.44. (a) Two versions of the Hilbert sampler of first order. 



Table 7.19 reports the three short-term Hilbert transformer coefficients 
and the maximum allowed frequency offset Af of the modulation, for the 
Hilbert filter providing a specified accuracy. 

Figure 7.44 shows two possible realizations for the Hilbert transformer 
that have been used to demodulate radio control watch signals [195, 196]. 
The first method uses three Sample fc Hold circuits and the second method 
uses three A/D converters to build a symmetric Hilbert sampler of first order. 

Figure 7.45 shows the spectral behavior of the Hilbert sampler with a 
direct undersampling by two. 

7.3.3 Coherent Demodulation 

If the phase </> 0 of the receiver is known, then demodulation can be accom- 
plished by multiplication and lowpass filtering. For AM, the received signal 
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Fig. 7.45. Spectra for the Hilbert sampler with undersampling. 



s(t) is multiplied by 2cos(u;o^ + </>o) and for PM or FM by — 2sin(u;o^ T <^o)- 
It follows that 

AM: 



A(t.) cos(2nfot + <f>o) x 2 cos(2tv f 0 t + <t> 0 ) 

= ■£(<)_ + ^4(^) cos(47t/ 0 < + 2^ 0 ) 

Lowpass component 

sam(0 — Aft) — ^0- 

PM: 



(7.61) 

(7.62) 



-2sin(27r/ 0 i + <f> 0 ) x cos(27r/ 0 i + <po + A<i>(t)) 

= sin(<4^(<)) + cos(4tt f 0 t + 2<fj 0 + A(j>(t)) 

Lowpass component 

i (A<f)(t)) ^ A<j)(t) 

1 



FM: 



Tm ft) = 



1 d A<j){t) 

7] dt 



(7.63) 

(7.64) 

(7.65) 



(7.66) 
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Fig. 7.46. Phase-locked loop (PLL) with (necessary) bandpass (/iBp(O), phase 
detector (PD), lowpass (/zlp(O)? an d voltage-controlled oscillator (VCO). 



rj is the so-called modulation index. 

In the following we will discuss the types of coherent receivers that are 
suitable for an FPGA implementation. Typically, coherent receivers provide 
a 1 dB better signal-to-noise ratio than an incoherent receiver (see Fig. 7.4, 
p. 306). A synchronous or coherent FM receiver tracks the carrier phase of 
the incoming signal with a voltage-controlled oscillator (VCO) in a loop. 
The DC part of this voltage is directly proportional to the FM signal. PM 
signal demodulation requires integration of the VCO control signal, and AM 
demodulation requires the addition of a second mixer and a 7r/2 phase shifter 
in sequence with a lowpass filter. The risk with coherent demodulation is 
that for a low signal-to-noise channel, the loops may be out-of-lock, and 
performance will decrease tremendously. 

There are two common types of coherent receiver loops: the phase-locked 
loop (PLL) and the Costas loop (CL). Figures 7.46 and 7.47 are block dia- 
grams of a PLL and CL, respectively, showing the nearly doubled complex- 
ity of the CL. Each loop may be realized as an analog (linear PLL/CL) or 
all-digital (ADPLL, ADCL) circuit (see [197, 198, 199, 200]). The stability 
analysis of these loops is beyond the scope of the book and is well covered in 
the literature ([201, 202, 203, 204, 205]). We will discuss efficient realizations 
of PLLs and CLs [206, 207]. The first PLL is a direct translation of an analog 
PLL to FPGA technology. 

Linear phase-locked loop. The difference between linear and digital loops 
lies in the type of input signal to be processed. A linear PLL or CL uses a fast 
multiplier as a phase detector, providing a possibly multilevel input signal to 
the loop. A digital PLL or CL can process only binary input signals. (Digital 
refers to the quality of the input signal here, not to the hardware realization!) 

As shown in Fig. 7.46, the linear PLL has three main blocks: 

• Multiplier as phase detector 

• Loop filter 
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Fig. 7.47. Costas loop with (necessary) bandpass (hep(t)), three phase detectors 
(PD), three lowpass filters (hLp(t)), and a voltage-controlled oscillator (VCO) with 
7r/2 phase shifter. 



• VCO 

To keep the loop in-lock, a loop output signal- to-noise ratio larger than 4 = 
6 dB is required [200, p. 35]. Since the selection of typical antennas is not 
narrow enough to achieve this, an additional narrow bandpass filter has been 
added to Figs. 7.46 and 7.47 as a necessary addition to the demodulator [207]. 
The “cascaded bandpass comb” filter (see Table 5.4, p. 209) is an efficient 
example. However, the filter design is much easier if a fixed IF is used, as in 
the case of a superhet or double superhet receiver. 

The VCO (or digitally controlled oscillator (DCO) for ADPLLs) oscillates 
with a frequency ujo = + A'o x Uj(t), where lo® is the resting point and Kq 

the gain of the VCO/DCO. For sinusoidal input signals we have the signal 

Udem(t) = K d sin (A<j)(t)) (7.67) 



Sink t Lability Limit 



Dynamic flahiiity limit 
± JuLUii HoM-ja ngge 



Pull -mil range 




SlaCK/dynamit: iHaftilHy linnl 

1 £1% = - AC0» = ± fltil! 

- mo Hj Costas range 

- 1 34J Hi Sender range 




(a) 



(b) 



Fig. 7.48. (a) Operation area PLL/CL. (b) Operation area of the CL of Fig. 7.51. 






7.3 Modulation and Demodulation 



355 



Table 7.20. Cost in CLBs of a linear PLL universal demodulator in a Xilinx 
XC3000 FPGA. 



Function group 


FM only 


FM, AM, and PM 


Phase detector (8 x 8-bit multiplier) 


65 


72 


Loop filter (two-stage CIC) 


84 


168 


Frequency synthesizer 


34 


34 


DCO (N/N-\-K divider, sin/cos table) 


16 


16+2 


PDSP Interface 


15 


15 


Total 


214 


307 



at the output of the lowpass, where A<f>(t) is the phase difference between 
the DCO output and the bandpass- filtered input signal. For small differences, 
the sine can be approximated by its argument, giving u^ em (t) proportional 
to A<f)(t) (the loop stays in-lock). If the input signal has a very sudden phase 
discontinuity, the loop will go out-of-lock. Figure 7.48 shows the different 
operation areas of the loop. The hold-in range u)q±Aloh is the static operation 
limit (useful only with a frequency synthesizer). The lock-in range is the area 
where the PLL will lock-in within a single period of the frequency difference 
uji — Ldo- Within the pull-in range, the loop will lock-in within the capture 
time 7L, which may last more than one period of uq — uj 2 - The pull-out 
range is the maximum frequency jump the loop can sustain without going 
out-of-lock. ujq± Aujpo is the dynamic operation limit used in demodulation. 
There is much literature optimizing PD, loop filter and the VCO gain; see 
[201, 202, 203, 204, 205]. 

The major advantage of the linear loop over the digital PLL and CL is its 
noise-reduction capability, but the fast multiplier used for the PD in a linear 
PLL has a particularly high hardware cost, and this PD has an unstable rest 
point at 7t/2 phase shift ( — 7r/2 is a stable point), which impedes lock-in. Table 
7.20 estimates the hardware cost in CLBs of a linear PLL, using the same 
functional blocks as the 8-bit incoherent receiver (see Example 7.15, p. 348). 
Using a hardware multiplier in multiplex for AM and PM demodulation, the 
right-hand column reduces the circuit’s cost by 58 CLBs, allowing it to fit 
into a 320 CLB Xilinx XC3090 device. 

A comparison of these costs to those of an incoherent receiver, consuming 
292 CLBs without CORDIC demodulation and 367.5 CLBs with an addi- 
tional CORDIC processor [59], shows a slight improvement in the linear PLL 
realization. If only FM and PM demodulation are required, a digital PLL or 
CL, described in the next two sections, can reduce complexity dramatically. 

These designs were developed to demodulate WeatherFAX pictures, which 
were transmitted in central Europe by the low-frequency radio stations 
DCF37 and DCF54 (Carrier 117.4kHz and 134.2 kHz; frequency modulation 
F1C: ± 150 Hz). 
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Fig. 7.49. Phase detector [18, Chap. 8 p. 127]. 



Digital PLLs. As explained in the last section, a digital PLL works with 
binary input signals. Phase detectors for a digital PLL are simpler than the 
fast multipliers used for linear PLLs; usual choices are XOR gates, edge- 
triggered JK flip-flops, or paired RS flip-flops with some additional gates 
[200, pp. 60-65]. The phase detector shown in Fig. 7.49 is the most complex, 
but it provides phase and frequency sensitivity and a quasi-infinite hold-in 
range. 

Modified counters are used as loop filters for DPLLs. These may be 
N/(N + K ) counters or multistage counters, such as an N-by-M divider, 
where separate UP and DOWN counters are used, and a third counter mea- 
sures the UP/DOWN difference. It is then possible to break off further sig- 
nal processing if a certain threshold is not met. For the DCO, any typical 
all-digital frequency synthesizer, such as an accumulator, divider, or multi- 
plicative generator may be used. The most frequently used synthesizer is the 
tunable divider, popular because of its low phase error. The low resolution 
of this synthesizer can be improved by using a low receiver IF [208]. 

One DPLL realization with very low complexity is the 74LS297 circuit, 
which utilizes a “pulse-stealing” design. This scheme may be improved with 
the phase- and frequency-sensitive J-K flip-flop, as displayed in Fig. 7.50. The 
PLL works as follows: the “detect flip-flop” runs with rest frequency 



Fr, 



N 



Tosc 

Ym 



(7.68) 



To allow tracking of incoming frequencies higher than the rest frequency, 
the oscillator frequency F 0 s C + is set slightly higher: 



T —T 

- 1 comp J- c 



comp+ — 2 ^° sc ’ 



(7.69) 
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Fig. 7.50. “Pulse-stealing” PLL [209]. 



Table 7.21. Hardware complexity of a pulse-stealing DPLL [208]. 



Function group 


CLBs 


DCO 


16 


Phase detector 


5 


Loop filter 


6 


Averaging 


11 


PC-interface 


10 


Frequency synthesizer 


26 


Stage machine 


10 


Total 


84 



such that the signal at point B oscillates half a period faster than the F osc 
signal. After approximately two periods at the rest frequency, a one will be 
latched in the detector flip-flop. This signal runs through the deglitch and 
delay flip-flops, and then inhibits one pulse of the -^-K divider (thus the name 
“pulse-stealing”). This delays the signal at B such that the phase of signal A 
runs after B , and the cycle repeats. The lock-in range of the PLL has a lower 
bound of Fi n |mm— 0 Hz. The upper bound depends on the maximum output 
frequency F osc+ /K , so the lock-in range becomes 

±Acj l = ±N x F oac +/(K x M). (7.70) 

A receiver can be simplified by leaving out the counters N and M. In 
a WeatherFAX image-decoding application the second IF of the double- 
superhet, receiver is set in such a way that the frequency modulation of 300 Hz 
(Sf = 0 Hz — >• white; Sf — 300 Hz — >• black) corresponds to exactly 32 steal 
pulses, so that the steal pulses correspond directly to the grayscale level. We 
set a pixel rate of 1920 Baud, and a IF of 16.6kHz. For each pixel, four “steal 
values” (number of steal pulses in an interval) are determined, so a total of 
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Zero crossing 




Fig. 7.51. Structure of the Costas loop. 



log 2 (2 x 4) = 3 bit shifts are used to compute the 16 gray-level values. Table 
7.21 shows the hardware complexity of this PLL type. 

Costas loop. This extended type of coherent loop was first proposed by 
John P. Costas in 1956, who used the loop for carrier recovery. As shown in 
Fig. 7.47, the CL has an in-phase and a quadrature path (subscripted I and 
Q there). With the 7r/2 phase shifter and the third PD and lowpass, the CL 
is approximately twice as complex as the PLL, but locks onto a signal twice 
as fast. Costas loops are very sensitive to small differences between in-phase 
and quadrature gain, and should therefore always be realized as all-digital 
circuits. The FPGA seems to be an ideal realization vehicle ([210, 211]). 

For a signal U{t) = A(^)sin(u;o/ + A<j>(t)) we get, after the mixer and 
lowpass filters, 

Ui{t) = K d A(t) cos (A(j){t)) (7.71) 

U Q {t) = K d A(t) sin (A<j>(t)) , (7.72) 

where 2Ii d is the gain of the PD. Ui(t) and Cq(/) are then multiplied together 
in a third PD, and are lowpass filtered to get the DCO control signal: 



UixQ{t) ~ K d sin (2A<f)(t )) . (7.73) 

A comparison of (7.67) and (7.73) shows that, for small modulations of 
the slope of the control signal Ui X Q{t) is twice the PLL’s. As in the PLL, if 
only FM or PM demodulation is needed, the PDs may be all digital. 
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Table 7.22. Loop filter output and DCO correction values at 32-times oversam- 
pling. 



Accumulator 


Under- 

flow 


Over- 

flow 


Sum 


DCO-IN 


gray 

value 


/carrier ± $f 


Yes 


No 


s< — (2 13 - 1) 


3 


0 


+180 Hz 


No 


No 


— ('2 13 - 1) <s< -2048 


2 


0 


+120 Hz 


No 


No 


-2048 <s< -512 


1 


4 


+60 Hz 


No 


No 


-512 <s< 512 


0 


8 


+0 Hz 


No 


No 


512<s< 2048 


-1 


12 


—60 Hz 


No 


No 


2048 <s< 2 13 - 1 


-2 


15 


-120 Hz 


No 


Yes 


s> 2 13 - 1 


-3 


15 


— 180 Hz 



Figure 7.51 shows a block diagram of a CL. The antenna signal is first 
filtered and amplified by a fourth-order Butterworth bandpass, then digitized 
by an 8-bit converter at a sampling rate 32 or 64 times the carrier base 
frequency. The resulting signal is split, and fed into a zero-crossing detector 
and a minimum/maximum detector. Two phase detectors compare the signals 
with a reference signal, and its 7r/2-shifted counterpart, synthesized by a high 
time-constant PLL with a reference accumulator [4, section 2]. Each phase 
detector has two edge detectors, which should generate a total of 4 UP and 
4 DOWN signals. If more UP signals than DOWN are generated by the 
PDs, then the reference frequency is too low, and if more DOWN signals are 
generated, it is too high. The differences J2 UP - DOWN are accumulated 
for one pixel duration in a 13-bit accumulator acting as a loop filter. The loop 
filter data are passed to a pixel converter, which gives “correction values” to 
the DCO as shown in Table 7.22. The accumulated sums are also used as 
grayscale values for the pixel, and passed onto a PC to store and display the 
WeatherFAX pictures. 

The smallest detectable phase offset for a 2 kBaud pixel rate is 

/carrier+l = 777 1 o WT . = 11746kHz - (7.74) 

V /carrier tph37 X * kBaud/ / carner 

where t p h 37 = 1/(32 x 117 kHz)=266 ns is the sampling period at 32-times 
oversampling. The frequency resolution is 117.46kHz —(/carrier = 117.4kHz) 
= 60 Hz. With a frequency modulation of 300 Hz, five grayscale values can 
be distinguished. Higher sampling rates for the accumulator are not possible 
with the 3164-4ns FPGA and the limited A/D converter used. With a fast 
A/D converter and an Altera Flex or Xilinx XC4K FPGA, 128- and 256-times 
oversampling are possible. 

For a maximum phase offset of 7 r, the loop will require a maximum lock- 
in time Tl of [16/3] = 6 samples, or about 1.5 /is. Table 7.23 shows the 
complexity of the CL for 32 and 64 times oversampling. 
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Table 7.23. Complexity of a fully digital Costas loop [207, p. 60]. 



Function group 


CLBs with 
oversampling 
32 times 64 times 


Frequency synthesizer 


33 


36 


Zero detection 


42 


42 


Maximum detection 


16 


16 


Four phase detectors 


8 


8 


Loop filter 


51 


51 


DCO 


12 


15 


TMS interface 


11 


11 


Sum 


173 


179 



Exercises 

7.1: The following MatLab code can be used to compute the order of an element. 

function N = order (x,M) 

'/# Compute the order of x modulo M 
p = x; 1=1 ; 
while p ~= 1 

1 = 1+1 ; p = p * x ; 
re =real(p); im = imag(p) ; 
p = mod(re,M) + i * mod(im,M) ; 
end; 

N=1 ; 

If, for instance, the function is called with order (2 ,2~25+l) the result is 50. To com- 
pute the single factors of 2 25 + 1, the standard MatLab function factor (2~25+l ) 
can be used. 

For 

(a) a = 2 and M = 2 41 +1 

(b) a = —2 and M — 2 29 — 1 

(c) a = 1 + j and M = 2 29 + 1 

(d) a = 1 + j and M — 2 26 — 1 

compute the transform length, the “bad” factors v (be., order not equal 
order(u, 2 B ± 1)), all “good” prime factors M/V, and the available input bit width 
B x = (log 2 {M/v) — log 2 (L))/2. 

7.2: To compute the inverse r -1 mod M for gcd(r,M) = 1 of the value r, we can 
use the fact that the following diophantic equation holds: 

gcd(r, M) = u x x -\- v x M with (7.75) 

(a) Explain how to use the MatLab function [g u v]=gcd(x,M) to compute the 
multiplication inverse. 

Compute the following multiplicative inverses if possible: 

(b) 3 -1 mod 73; 

(c) 64 -1 mod 2 32 -f 1; 

(d) 31 _1 mod 2 31 — 1; 
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(e) 89 _1 mod 2 11 - 1; 

(f) 641 -1 mod 2 32 + 1. 



7 . 3 : The following MatLab code can be used to compute Fermat NTTs for length 
2, 4, 8, and 16 modulo 257. 

function Y = ntt(x) 

*/, Compute Fermat NTT of length 2,4,8 and 16 modulo 257 
1 = length(x) ; 
switch (1) 

case 2, alpha=-l; 
case 4, alpha=16; 
case 8, alpha=4; 
case 16, alpha=2 ; 

otherwise, disp(’NTT length not supported’) 
end 

A=ones(l,l); A (2 , 2) =alpha; 

%*********Comput ing second column 
for m=3:l 

A (m , 2 ) =mod (A (m- 1,2)* alpha , 257 ) ; 
end 

•/,**** *****Comput ing rest of matrix 
for m=2:l 
for n=2:l-l 

A (m,n+l) =mod(A (m,n) *A(m,2) ,257) ; 
end 
end 

# /,*********Comput ing NTT A*x 
for k = 1:1 
Cl = 0; 
for j = 1:1 

Cl = Cl + A (k , j ) * x(j) ; 
end 

X (k) = mod(Cl , 257) ; 
end 
Y=X; 

(a) Compute the NTT X of x = {1, 1, 1, 1, 0 , 0, 0 , 0}. 

(b) Write the code for the appropriate INTT. Compute the INTT of X from part 
(a). 

(c) Compute the element-by-element product Y = X Q X and INTT(Y) = y. 

(d) Extend the code for a complex Fermat NTT and INTT for a = 1 + j. Test 
your program with the identity x = INTT(NTT(a?)). 



7 . 4 : The Walsh transform for N = 4 is given by: 



W 4 



1111 
1 1 - 1-1 
1 - 1-1 1 
1-1 1-1 



(a) Compute the scalar product of the row vectors. What property does the matrix 
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have? 

(b) Use the results from (a) to compute the inverse W^ 1 . 

(c) Compute the 8 x 8 Walsh matrix Wg, by scaling the original row vector by 
two (i.e., h[n/ 2 ]) and computing an additional two “children” h[n] + h[n — 4 ] and 
h[n] — h[n — 4] from row 3 and 4. There should be no zero in the resulting W g 
matrix. 

(d) Draw a function tree to construct Walsh matrices of higher order. 



7.5: The Hadamard matrix can be computed using the following iteration 



H 2 l + 1 



H 2 i H 2 i 
H 2 i -H 2 i ’ 



(7.76) 



with Hi = [ 1 ]. 

(a) Compute and Hg. 

(b) Find the appropriate index for the rows in if 4 and Hg, compared with the 
Walsh matrix W 4 and Wg from Exercise 7.4. 

(c) Determine the general rule to map a Walsh matrix into a Hadamard matrix. 
Hint: First compute the index in binary notation. 



7.6: The following MatLab code can be used to compute the state-space description 
for pi 4 =£ 14 -i-:r 5 + :r 3 +:r 1 -fl, the nonzero elements using nnz, and the maximum 
fan-in. 



p= input ( ’Please define power of matrix = ’) 

A=zeros (14 , 14) ; 
for m=l : 13 
A (m,m+l ) =1 ; 
end 

A (14 , 14) =1 ; 

A(14, 13)=1 ; 

A ( 14 , 1 1 ) =1 ; 

A(14,9)=l ; 

Ap=mod(A~p , 2) ; 
nnz (Ap) 

max (sum (Ap, 2) ) 

(a) Compute the number of nonzero elements and fan-in for p = 2 to 8. 

(b) Modify the code to compute the twin pi 4 = x 14 + x 13 + r n + r 9 -f 1. Compute 
the number of nonzero elements for the modified polynomial for p = 2 to 8. 

(c) Modify the original code to compute the alternative LFSR implementation 
(see Fig. 7.21, p. 326) for (a) and (b) and compute the nonzero elements for p = 2 
to 8. 



Exercises Using MaxPlusII 

7.7: (a) Compile the code for the length -6 LFSR lfsr.vhd from Example 7.12 
(p. 325) using MaxPlusII. 

(b) For the line 

ff(l) <= NOT (f f (5) XOR f f (6) ) ; 
substitute 

ff (1) <= ff (5) XNOR ff (6) ; 




Exercises 
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and compile with MaxPlusII. 

(c) Now change the Compiler settings Interfaces — >• VHDL Netlist Reader 
Settings from VHDL 1987 to VHDL 1993 and compile again. Explain the results. 
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The filters we have discussed so far had been designed for applications where 
the requirements for the “optimal” coefficients did not change over time, i.e., 
they were LTI systems. However, many real-world signals we find in typical 
DSP fields like speech processing, communications, radar, sonar, seismology, 
or biomedicine, require that the “optimal” filter or system coefficients need to 
be adjusted over time depending on the input signal. If the parameter changes 
slowly compared with the sampling frequency we can compute a “better” 
estimation for our optimal coefficients and adjust the filter appropriate. 

In general, any filter structure, FIR or HR, with the many architectural 
variations we have discussed before, may be used as an adaptive digital filter 
(ADF). Comparing the different structural options, we note that 

• For FIR filters the direct form from Fig. 3.1 (p. 110) seems to be advanta- 
geous because the coefficient update can be done at the same time instance 
for all coefficients. 

• For HR filters the lattice structure shown in Fig. 4.12 (p. 159) seems to 
be a good choice because lattice filters possess a low fixed-point arithmetic 
roundoff error sensitivity and a simplified stability control of the coeffi- 
cients. 

From the published literature, however, it appears that FIR filters have been 
used more successfully than HR filters and our focus in this chapter will 
therefore be efficient and fast implementation of adaptive FIR filters. 

The FIR filter algorithms should converge to the optimum nonrecursive es- 
timator solution given (originally for continuous signal) through the Wiener- 
Hopf equation [212]. We will then discuss the optimum recursive estimator 
(Kalman filter). We will compare the different options in terms of compu- 
tational complexity, stability of the algorithms, initial speed of convergence, 
consistency of convergence, and robustness to additive noise. 

Adaptive filters can now be seen to be a mature DSP field. Many books 
in their first edition had been published in the mid-1980s and can be used 
for a more in-depth study [213, 214, 215, 216, 217, 218]. More recent results 
may be found in textbook like [219, 220, 221]. Recent journal publications 
like IEEE Transactions on Signal Processing show, especially in the area of 
stability of LMS and its variations, essential research activity. 
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8.1 Application of Adaptive Filter 

Although the application fields of adaptive filters are quite broad in nature, 
they can usually be described with one of the following four system configu- 
rations: 

• Interference cancellation 

• Prediction 

• Inverse modeling 

• Identification 

We wish to discuss in the following the basic idea of these systems and 
present some typical successful applications for these classes. Although it may 
not always exactly describe the nature of the specific signals it is common to 
use the following notation for all systems, namely 



x — input to the adaptive filter 
y — output of the adaptive filter 
d = desired response (of the adaptive filter) 
e — d — y — estimation error 

8.1.1 Interference Cancellation 

In these very popular applications of the adaptive filter the incoming signal 
contains, beside the information-bearing signal, also an interference, which 
may, for example, be a random white noise or the 50/60 Hz power-line hum. 
Figure 8.1 shows the configuration for this application. The incoming (sensor) 
signal d[n] and the adaptive filter output response y[n\ to a reference signal 
x[n\ is used to compute the error signal e[n], which is also the system output 
in the interference cancellation configuration. Thus, after convergence, the 
(modified) reference signal, which will represent the additive inverse of the 
interference is subtracted from the incoming signal. 



Primary 
signal " 



Input 

signal 



x[n] 



Adaptive 
filter 

7 

e[n| 




System 

output 



Fig. 8.1. Basic configuration for interference cancellation. 
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We will later study a detailed example of the interference cancellation 
of the power-line hum. A second popular application is the adaptive noise 
cancellation of echoes on telephone systems. Interference cancellation has 
also been used in an array of antennas (called beamformer) to adaptively 
remove noise interferring from unknown directions. 



8.1.2 Prediction 

In the prediction application the task of the adaptive filter is to provide a 
best prediction (usually in the least mean square sense) of a presentvalue of a 
random signal. This is obviously only possible if the input signal is essential 
different from white noise. Prediction is illustrated in Fig. 8.2. It can be seen 
that the input d[n] is applied over a delay to the adaptive filter input, as well 
as to compute the estimation error. 

The predictive coding has been successfully used in image and speech 
signal processing. Instead of coding the signal directly, only the prediction 
error is encoded for transmission or storage. Other applications include the 
modeling of power spectra, data compression, spectrum enhancement, and 
event detection [214]. 

8.1.3 Inverse Modeling 

In the inverse modeling structure the task is to provide an inverse model that 
represents the best fit (usually in the least squares sense) to an unknown 
time- varying plant. A typical communication example would be the task to 
estimate the multipath propagation of the signal to approximate an ideal 
transmission. The system shown in Fig. 8.3 illustrates this configuration. 
The input signal d[n] enters the plant and the output of the unknown plant 
x[n] is the input to the adaptive filter. A delayed version of the input d[n] is 
then used to compute the error signal e[n\ and to adjust the filter coefficients 
of the adaptive filter. Thus, after convergence, the adaptive filter transfer 
function approximates the inverse of the transfer function of the unknown 
plant. 

Besides the already-mentioned equalization in communication systems, 
inverse modeling with adaptive filters has been successfully used to improve 




Fig. 8.2. Block diagram for prediction. 
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Fig. 8.3. Schematic diagram illustrating the inverse system modeling. 



S/N ratio for additive narrowband noise, for adaptive control systems, in 
speech signal analysis, for deconvolution, and digital filter design [214]. 

8.1.4 Identification 

In a system identification application the task is that the filter coefficients of 
the adaptive filter represent an unknown plant or filter. The system identi- 
fication is shown in Fig. 8.4 and it can be seen that the time series, x[n], is 
input simultaneously to the adaptive filter and another linear plant or filter 
with unknown transfer function. The output of the unknown plant d[n] be- 
comes the output of the entire system. After convergence the adaptive filter 
output y[n] will approximate d[n] in an optimum (usually least mean squares) 
sense. Provided that the order of the adaptive filter matches the order of the 
unknown plant and the input signal #[ 77 ] is WSS the adaptive filter coeffi- 
cients will converge to the same values as the unknown plant. I 11 a practical 
application there will normally be an additive noise present at the output of 
the unknown plant (observation errors) and the filter structure will not ex- 
actly match that of the unknown plant. This will result in deviation from the 
perfect performance described. Due to the flexibility of this structure and the 
ability to individually adjust a number of input parameters independently it 
is one of the structures often used in the performance evaluations of adap- 
tive filters. We will use these configurations to make a detailed comparison 
between LMS and RLS, the two most popular algorithms to adjust the filter 
coefficient of an adaptive filter. 

Such system identification has been used for modeling in biology, or to 
model social and business systems, for adaptive control systems, digital filter 
design, and in geophysics [214]. In a seismology exploration, such systems 
have been used to generate a layered-earth model to unravel the complexities 
of the earth’s surface [213]. 

8.2 Optimum Estimation Techniques 

Required signal properties. In order to use successfully the adaptive filter 
algorithms presented in the following and to guarantee the convergence and 
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stability of the algorithms, it is necessary to make some basic assumptions 
about the nature of our input signals, which from a probabilistic standpoint, 
can be seen as a vector of random variables. First, the input signal (i.e., 
the random variable vector) should be ergodic , i.e., statistical properties like 
mean 

l N ~ 1 

n = E\x] = lim — > x[n] 

1 1 J iv — y oo TV L J 

n — 0 



or variance 



a 



2 



E{x 2 } 



lim 

iV— >-oo 



2p Y (*[«]-»?) 2 



computed using a single input signal should show the same statistical prop- 
erties like the average over an assemble of such random variables. Secondly, 
the signals need to be wide sense stationary (WSS), i.e., statistics measure- 
ments like average or variance measured over the assemble averages are not 
a function of the time, and the autocorrelation function 



r[r] = E{x[ti\x[t 2 \} — E{x[t + r]z[f]} 

N— 1 

x\n\x[n T r] 



= lim 1 V 

TV— >■ oo TV ^ 

n = 0 



depends only on the difference r — t\ — ^ 2 - We note in particular that 

r[0} = E{x[t]x[t]} = E{\x[t}\ 2 } (8.1) 

computes the average power of the WSS process. 



Definition of cost function. The definition of the cost function applied to 
the estimator output is a critical parameter in all adaptive filter algorithms. 
We need to “weight” somehow the estimation error 

e[n] = d[n] - y[n], (8.2) 
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Fig. 8.4. Basic configuration for identification. 
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Fig. 8.5. Three possible error cost functions. 



where d[n] is the random variable to be estimated, and y[n] is the computed 
estimate via the adaptive filter. The most commonly used cost function is 
the least-mean-squares (LMS) function given as 

J = E{e 2 [n]} = (d[n}~ y[n]) 2 . (8.3) 

It should be noted that this is not the only cost function that may be used. 
Alternatives are functions such as the absolute error or the nonlinear thresh- 
old functions as shown in Fig 8.5 on the right. The nonlinear threshold type 
may be used if a certain error level is acceptable and as we will see later 
can reduce the computational burden of the adaptation algorithm. It may 
be interesting to note that the original adaptive filter algorithms by Widrow 
[216] uses such a threshold function for the error. 

On the other hand, the quadratic error function of the LMS method will 
enable us to build a stochastic gradient approach based on the Wiener-Hopf 
relation originally developed in the continuous signal domain. We review the 
Wiener-Hopf estimation in the next subsection, which will directly lead to 
the popular LMS adaptive filter algorithms first proposed by Widrow et al. 
[222, 223]. 



8.2.1 The Optimum Wiener Estimation 

The output of the adaptive FIR filter is computed via the convolution sum 
L - 1 

y[n] = ^2fkx[n-k], (8.4) 

k - 0 

where the filter coefficients fk have to be adjusted in such a way that the 
defined cost function J is minimum. It is, in general, more convenient to write 
the convolution with vector notations according to 

y[n] = x T [n]f = f T x[n ], 



(8.5) 
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with f = [fofi . . ./l-T , x[n] = [x[n]x[n- 1] . ..x[n - ( L - 1)]] T , are size 
(Lx 1) vectors and T means matrix transposition or the Hermitian transpo- 
sition for complex data. For A — [a[Ar,/]] the transposed matrix is “mirrored” 
at the main diagonal, i.e., A T = [a[/,fc]]. Using the definition of the error 
function (8.2) we get 

e[n] = d[n] - j/[n] = d[n] - f T x[n ]. (8.6) 

The mean square error function now becomes 

J = E{e 2 [n]} = E{d[n] - y[n}} 2 = E{d[n } - fx[n ]} 2 
= E{(d[n] - f T x[n])(d[n\ - x T [n]f)} 

= E{d[n ] 2 — 2 d[n]f T x[n] + f T x[n]x T [n]f}. (8.7) 

Note that the error is a quadratic function of the filter coefficients that can be 
pictured as a concave hyperparaboloidal surface, a function that never goes 
negative, see Fig. 8.6 for an example with two filter coefficients. Adjusting 
the filter weights to minimize the error involves descending along this surface 
with the objective of getting to the bottom of the bowl. Gradient methods 
are commonly used for this purpose. The choice of mean square type of 
cost function will enable a well-behaved quadratic error surface with a single 
unique minimum. The cost is minimum if we differentiate (8.7) with respect 
to / and set this gradient to zero, i.e., 



V = W = E + = »■ 



Assuming that the filter weight vector / and the signal vector x[n ] are sta- 
tistically independent (i.e., uncorrelated), it follows, that 

E{d[n]x[n]} = E {x[n]x T [n]} f opt , 



then the optimal filter coefficient vector f opt can be computed with, 

/opt = E{x[n]x T [n]}~ l E{d[n]x{n}}. (8.8) 

The expectation terms are usually defined as follows: 



i^x = E{x[n]x T [n]} 



= E 



x[n\x\n\ 
x[n — l]x[n] 



x[n]x[n — 1] ... x[n\x[n — (L — 1)] 

x\n — 1 ]x[n — 1] ... 



\_x[n — (L — l)]-:r[n] 

r[ 0] r[ 1] . . . r[L — 1] 

r[l] r[ 0] . . . r[L — 2] 



r[L - 1] r[L - 2] . . . r[ 0] 
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Fig. 8.6. Error cost function for the two-component case. The minimum of the 
cost function is at fo = 25 and /i = 43.3. 

is the (L x L) autocorrelation matrix of the input signal sequence, which has 
the form of the Toeplitz matrix, and 

r<ix. = E{d[n]x[n]} 

d[n]x[n ] 1 T r dx [ 0] 

d[n]x[n — 1] r dx [ 1] 

d[n]x[n - (L — 1)]_ r dx [L-\ ] 

is the (L x 1) cross-correlation vector between the desired signal and the refer- 
ence signal. With these definitions we can now rewrite (8.8) more compactly 
as 

/opt = R-xx T dx • (8.9) 

This is commonly recognized as the Wiener-Hopf equation [212], which 
yields the optimum LMS solution for the filter coefficient vector / opt . One 
requirement to have a unique solution for (8.9) is that exist, i.e., the 

autocorrelation matrix must be nonsingular, or put differently, the determi- 
nate is nonzero. Fortunately, it can be shown that for WSS signals the R xx 
matrix is nonsingular [213, p. 41] and the inverse exists. 
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Fig. 8.7. Signals used in power- line hum example 8.1. 



Using (8.7) the residue error of the optimal estimation becomes: 

Jopt = E{d[n] - /opt*M} 2 

= rjrffn]} 2 — 2/^p t r , c ( a ; + f 0 p t R xx f opt^ 

rdx 

Jopt = rddfO] - fl pt r dx , (8.10) 



where rdd[ 0] = &d the vaiaance °f d. 

We now wish to demonstrate the Wiener-Hopf algorithm with the follow- 
ing example. 



Example 8.1: Two-tap FIR Filter Interference Cancellation 

Suppose we have an observed communication signal that consists of three 
components: The information-bearing signal, which is a Manchester encoded 
sensor signal m[n\ with amplitude B = 10, shown in Fig. 8.7a; an additive 
white Gaussian noise n[n], shown in Fig. 8.7b; and a 60-Hz power- line hum 
interference with amplitude A = 50, shown in Fig. 8.7c. Assuming the sam- 
pling frequency is 4 times the power- line hum frequency, i.e., 4 x 60 = 240 Hz, 
the observed signal can therefore be formulated as follows 
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d[n] = Acos[7rn/2] + Bm[n\ + cr 2 n[n]. 

The reference signal x[n] (shown in Fig. 8.7d), which is applied to the adaptive 
filter input, is given as 

x[n ] = cos[7rn/2 + </>], 

where <f> = 7r/6 is a constant offset. The two-tap filter then has the following 
output: 



x [n\ = fo cos 



— n + </> I + /i cos 



L 2 



(n - 1) + < 



To solve (8.9) we compute first the autocorrelation for x[n] with delays 0 and 
1: 

r^[0] = F{(cos[7rn/2 +0]) 2 } = i 

r^[l] = F{cos[7rn/2 -f <t>] sin[7rn/2 + </>]} = 0. 

For the cross-correlation we get 

^[0] = E { (A cos[7m/2] + Bm[n\ + a 2 n[n]) cos[7rra/2 + 0]} 

A 

= — cos (<j>) = 12.5^3 

r <^[l] — E { cos[7rn/2] + Bm[n] + a 2 n[n]) sin[7rn/2 + </>]} 

A ,50 

= — cos(0 — 7T) = — = 12.5. 



As required for the Wiener-Hopf equation (8.9) we can now compute the 
(2 x 2) autocorrelation matrix and the (2x1) cross-correlation vector and 
get 



> opt 



= Rx 



C dx 



r xx [0] r xx [l] 


-l 


rdx[0] 


r xx [l] r xx [0] 




r dx [l] 



1 


12.5^3 




'2 0 * 


12.5a/3 




12.5 




0 2 


12.5 



25\/3 




’43.3" 


25 




25 



The simulation of these data is shown in Fig. 8.8. It shows (a) the sum of 
the three signals (Manchester-coded 5 bits, power-line hum of 60 Hz, and the 
additive white Gaussian noise) and the system output (i.e., e[n]) with the 
canceled power- line hum. I 8 .i I 



8.3 The Widrow-Hoff Least Mean Square Algorithms 

There may exist a couple of reasons why we wish to avoid a direct com- 
putation of the Wiener estimation (8.9). First, the generation of the au- 
tocorrelation matrix R xx and the cross-correlation vector r d x are already 
computationally intensive. We need to compute the autocorrelation of x and 
the cross-correlation between d and x and we may, for instance, not know 
how many data samples we need to use in order to have sufficient statistics. 
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(a) d[n]=data+noise+hum 




time in s 



(b) System output e[n] 




Fig. 8.8. Canceling 60-Hz power-line interference of a Manchester-coded data signal 
using optimum Wiener estimation. 



Secondly, if we have constructed the correlation functions we still have to 
compute the inverse of the autocorrelation matrix R xx _1 , which can be very 
time consuming, if the filter order gets larger. Even if a procedure is available 
to invert R xx , the precision of the result may not be sufficient because of the 
many computational steps involved, especially with a fixed-point arithmetic 
implementation. 

The Widrow-Hoffleast mean square (LMS) adaptive algorithms [222] is a 
practical method for finding a close approximation to (8.9) in real time. The 
algorithm does not require explicit measurement of the correlation functions, 
nor does it involve matrix inversion. Accuracy is limited by statistical sample 
size, since the filter coefficient values are based on the real-time measurements 
of the input signals. 

The LMS algorithm is an implementation of the method of the steepest 
descent. According to this method, the next filter coefficient vector f[n + 1] 
is equal to the present filter coefficient vector f\n] plus a change proportional 
to the negative gradient: 

f[n + 1] = f[n] - ^V[n]. (8.11) 

The parameter (i is the learning factor or step size that controls stability 
and the rate of convergence of the algorithm. During each iteration the true 
gradient is represented by V[n]. 

The LMS algorithm estimates an instantaneous gradient in a crude but 
efficient manner by assuming that the gradient of J — e[n] 2 is an estimate of 
the gradient of the mean-square error E{e[n] 2 }. The relationship between the 
true gradient Y[n] and the estimated gradients V[n] is given by the following 
expression: 

~dE{e[n] 2 } d E{e[n} 2 } 8E{e[n ] 2 } 

df Q ’ 8h 



V[n] = 



df L - 1 J 



( 8 . 12 ) 
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vm = 



de[n ] 2 de[n] 



de [n ] A 



L dfo ’ dh ""'dfL-r 

r -1 |"<9e[n] de[n\ de[n 

2e[^ 



iT 



[ df 0 ’ dfi df L -i\ ' (8 ' 13) 

The estimated gradient components are related to the partial derivatives 
of the instantaneous error with respect to the filter coefficients, which can be 
obtained by differentiating (8.6), it follows that 



V[n] = -2e[n]^l = -2e[n}x[n]. 

Using this estimate in place of the true gradient in (8.11) yields: 



f[n + 1] = f[n] - — V[n] = f[n] 



fie[n]Q 



(8.14) 



(8.15) 



Let us summarize all necessary step for the LMS algorithm 2 in the fol- 
lowing 

Algorithm 8.2: Widrow~Hoff LMS Algorithm 

The Widrow-Hoff LMS algorithm to adjust the L filter coefficients of an 
adaptive uses the following steps: 

1) Initialize the (Lx 1) vector / = # = () = [0,0,..., 0] T . 

2 ) Accept a new pair of input samples {t[u], d[n]} and shift x[n] in the 
reference signal vector x[n]. 

3 ) Compute the output signal of the FIR filter, via 

y[n] = f T [n]x[n], (8.16) 

4 ) Compute the error function with 

e[/i] = c/[n] — y[n], (8-17) 

5 ) Update the filter coefficients according to 

f[n+ 1] = = f[n] + ne[n]x[n]. (8.18) 

Now continue with step 2. 

Although the LMS algorithm makes use of gradients of mean-square error 
functions, it does not require squaring, averaging, or differentiation. The al- 
gorithm is simple and generally easy to implement in software (Mat Lab code 
see, for instance, [221, p. 332]; C code [224], or PDSP assembler code [225]). 

A simulation using the same system configuration as in Example 8.1 
(p. 373) is shown in Fig. 8.9 for different values of the step size fj . Adap- 
tation starts after 1 second. System output e[n] is shown in the left column 
and the filter coefficient adaptation on the right. We note that depending on 
the value /i the optimal filter coefficients approach /o = 43.3 and fi = 25. 

It has been shown that the gradient estimate used in the LMS algorithm 
is unbiased and that the expected value of the weight vector converges to 



2 Note that in the original presentation of the algorithm [222] the update equation 
f\n + 1] = f[n] + 2/je[n]#[n] is used because the differentiation of the gradient 
in (8.14) produces a factor 2. The update equation (8.15) follows the notation 
that is used in most of the current textbooks on adaptive filters. 
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Fig. 8.9. Simulation of the power-line interference cancellation using the LMS 
algorithm for three different values of the step size jj. (left) System output e[n]. 
(right) Filter coefficients. 



the Wiener weight vector (8.9) when the input signals are WSS, which was 
anyway required in order to be able to compute the inverse of the autocor- 
relation matrix for the Wiener estimate. Starting with an arbitrary 

initial filter coefficient vector, the algorithm will converge in the mean and 
will remain stable as long as the learning parameter /i is greater than 0 but 
less than an upper bound fi m ax . Figure 8.10 shows an alternative form to 
represent the convergence of the filter coefficient adaptation by a projection 
of the coefficient values in a (/o, /i) mapping. The figure also shows the con- 
tour line with equal error. It can be seen that the LMS algorithm moves in a 
zigzag way towards the minimum rather than the true gradient, which would 
move exactly orthogonal to these error contour lines. 

Although the LMS algorithm is considerably simpler than the RLS al- 
gorithm (we will discuss this later) the convergence properties of the LMS 
algorithm are nonetheless difficult to analysis rigorously. The simplest ap- 
proach to determine an upper bound of (i makes use of the eigenvalues of 
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analysis we may also transform the filter coefficient in independent so-called 
“modes” that are no longer linear dependent. The number of natural modes 
is equal to the number of degrees of freedom, i.e., the number of independent 
components and in our case identically with the number of filter coefficients. 
The time constant of the k th mode is related to the k eigenvalue A [k] and the 
parameter /i by 



r[k] = 



1 

2fi\[k] 



(8.23) 



Hence the longest time constant, r max , is associated with the smallest eigen- 
value, A m i n via 



An ax 



2fi\ n 



(8.24) 



Combining (8.22) and (8.24) gives 

_ \ ^max 

Anax > 777 > 



(8.25) 



which suggests that the larger the eigenvalue ratio (EVR), A max /A m i n of the 
autocorrelation matrix R& x the longer the LMS algorithm will take to con- 
verge. Simulation results that confirm this finding can be found for instance, 
in [220, p. 64] and will be discussed, in Sect. 8.3.1 (p. 381). 

The results presented so far on the ADF stability can be found in most 
original published work by Widrow and many textbooks. However, these 
conditions do not guarantee a finite variance for the filter coefficient vector, 
neither do they guarantee a finite mean-square error! Hence, as many users 
of the algorithm realized, considerably more stringent conditions are required 
to ensure convergence of the algorithm. In the examples in [221, p. 130], for 
instance, you find the “rule of thumb” that a factor 10 smaller values for (i 
should be used. 

More recent results indicate that the bound from (8.22) must be more 
restrictive. For example, the results presented by Horowitz and Senne [226] 
and derived in a different way by Feuer and Weinstein [227] show that the 
step size (assuming that the elements of the input vector x[n\ are statistically 
independent) has to be restricted via the two conditions: 



0 < fi < 



L- 1 

Et 

1 = 0 



1 

A i 

HX i 



l = 0,1 i — 1 



and 



i — jiXi 



< 2 , 



(8.26) 

(8.27) 



to ensure convergence. These conditions can not be solved analytically, but 
it can be shown that they are closely bounded by the following condition: 



0 < ji < 



2 _ 2 

3 x trace (R xx ) 3 x L x r^fO] 



(8.28) 
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Fig. 8.11. Simulation of the power-line interference cancellation using the maxi- 
mum step size values for the LMS algorithm, (left) System output e[n]. (right) 
filter coefficients. 



The upper bound of (8.28) has a distinct practical advantage. Trace of R xx 
is, by definition, (see (8.21), p. 378) the total average input signal power of 
the reference signal, which can easily be estimated from the reference signal 
x[n}. 

Example 8.3: Bounds on Step Size 

From the analysis in (8.22) we see that we first need to compute the eigen- 
values of the R xx matrix, i.e. 

0 = det(AJ- «,.) = ['"‘[“Iq A j/fq'J a] (S ->9> 

= det °'V A 0.5°-A =(°- 5 - A ) 2 (8.30) 

A[l, 2] = 0.5. (8.31) 

Using (8.22) gives 

^max = ~r~ = 4. (8.32) 

^max 

Using the more restrictive bound from (8.28) yields 
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Fig. 8.12. System identification configuration for LMS learning curves. 

2 _ 2 _ 2 
,lmax “ Lx 3 x r xx [ 0] “ 3 x 2 x 0.5 “ 3' 

The simulation results in Fig. 8.11 indicate that in fact ji 
convergence, while fi — 2/3 converges. 



(8.33) 

= 4 does not show 




We also note from the simulation shown in Fig. 8.11 that even with 
/i max = *2/3 the convergence is much faster, but the coefficients “ripple 
around” essentially. Much smaller values for ^ are necessary to have a smooth 
approach of the filter coefficient to the optimal values and to stay there. 

The condition found by Horowitz and Senne [226] and Feuer and Wein- 
stein [227] made the assumption that all inputs x[n\ are statistically indepen- 
dent. This assumption is true if the input data come, for instance, from an 
antenna array of L independent sensors, however, for ADFs with the tapped 
delay structure, it has been shown, for instance, by Butterweck [213], that 
for a long filter the stability bound can be relaxed to 



0 < ji < 



2 

L x ry^fO] 



(8.34) 



i.e., compared with (8.28) the upper bound can be relaxed by a factor of 3 in 
the denominator. But the condition (8.34) only applies for a long filter and 
it may therefore saver to use (8.28). 



8.3.1 Learning Curves 

Learning curve, i.e., the error function J displayed over the number of itera- 
tions is an important measurement instrument when comparing the perfor- 
mance of different algorithms and system configurations. We wish in the fol- 
lowing to study the LMS algorithm regarding the eigenvalue ratio A max /A min 
and the sensitivity to signal-to-noise (S/N) ratio in the system to be identi- 
fied. 
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Fig. 8.13. Eigenvalues ratio for a three-tap filter system for different system size 

L. 



A typical performance measurement of adaptive algorithms using a sys- 
tem-identification problem is displayed in Fig. 8.12. The adaptive filter has a 
length of L = 16 the same length as the “unknown” system, whose coefficients 
have to be learned. The additive noise level behind the “unknown” system 
has been set to two different levels —10 dB for a high-noise environment and 
to —48 dB for a low-noise environment equivalent to an 8-bit quantization. 

For the LMS algorithm the eigenvalue ratio (EVR) is the critical param- 
eter that determines the convergence speed, see (8.25), p. 379. In order to 
generate a different eigenvalue ratio we use a white Gaussian noise source 
with cr 2 — 1 that is shaped by a digital filter. We may, for instance, use a 
first-order HR filter that generates a first-order Markov process, see Exercise 
8.10 (p. 420). We may alternatively filter the white noise by a three-tap sym- 
metrical FIR filter whose coefficients are c T — [a, 6, a]. The FIR filter has the 
advantage that we can easily normalize the power. The coefficients should be 
normalized c l — 1 i n suc h a wa y that input and output sequences have 
the same power. This requires that 

1 = a 2 + b 2 + a 2 or a = 0.5 x \Jl — 6 2 . (8.35) 

With this filter it is possible to generate different eigenvalue ratios A max /A min 
as shown in Fig. 8.13 for different system size L — 2,4,8, and 16. We can 
now use Table 8.1 to get power-of-ten EVRs for the system of length L — 16. 
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Table 8.1. Four different noise- shaping FIR filters to generate power-of-ten eigen- 
value ratios for L — 16. 



No. Impulse response EVR 



1 


0 + lz -1 -fO.Oz -2 


1 


2 


0.247665 + 0.936656z _1 


+ 0.247665z -2 


10 


3 


0.577582 + 0.576887z -1 


+ 0.577582z -2 


100 


4 


0.432663 + 0.790952z -1 


+ 0.432663 2 -2 


1000 



For a white Gaussian source the R xx matrix is a diagonal matrix a 2 1 
and the eigenvalues are therefore all one, i.e., A / = l;/ = 0, 1,...,L — 1. The 
other EVRs can be verified with MatLab, see Exercise 8.9 (p. 420). The 
impulse response of the unknown system gk is an odd filter with coefficients 
1, —2, 3, —4, . . . , —3, 2, —1 as shown in Fig. 8.14a. The step size for the LMS 
algorithm has been determined with 



Fmax — 



2 

3 x Lx E{x 2 } 



1 

24 



(8.36) 



In order to guarantee perfect stability the 
fi — /i max /2 = 1/48. The learning curve, or 
the normalized error function 

>15 

J[n] = 20 log, 0 



EL(ft-AH ) 2 



V' 15 0 2 

2—jk = 0 Uk 



step size has been chosen to be 
coefficient error is computed via 



(8.37) 



The coefficient adaptation for a single adaptation run with EVR=1 is shown 
in Fig. 8.14b. It can be seen that after 200 iterations the adaptive filter has 
learned the coefficient of the unknown system without an error. From the 
learning curves (average over 50 adaptation cycles) shown in Fig. 8.14c and 
d it can be seen that the LMS algorithm is very sensitive to the EVR. Many 
iterations are necessary in the case of the high EVR. Unfortunately, many 
real-world signals have high EVR. Speech signals, for instance, may have 
EVR of 1874 [228]. On the other hand, we see from Fig. 8.14c that the LMS 
algorithm still adapts well in a high-noise environment. 



8.3.2 Normalized LMS (NLMS) 

The LMS algorithm discussed so far uses a constant step size fi proportional 
to the stability bound /i max = 2 /(Lx r^fO]). Obviously this requires knowlege 
of the signal statistic, i.e., r^fO], and this statistic must not change over time. 
It is, however, possible that this statistic changes over time, and we wish to 
adjust fi accordingly. These can be accomplished by computing a temporary 
estimate for the signal power via 

J\r*[0] = la i T [n]x[n], 



(8.38) 
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Fig. 8.14. Learning curves for the LMS algorithm using the system identification 
configuration shown in Fig. 8.12. (a) Impulse response g k of the “unknown” system, 
(b) Coefficient learning over time, (c) Average over 50 learning curves for large 
system noise, (d) Average over 50 learning curves for small system noise. 



and the “normalized 51 (i is given by 

r_i _ 2 

/^maxPj — Tr -I 

x 1 [n\x[n\ 



If we are concerned that the denominator can temporary become very small 
and n too large, we may add a small constant £ to x T [n]x[n], which yields 




To be on the safe side, we would not choose // max [n]. Instead we would use 
a somewhat smaller value, like 0.5 x /i max [n]. The following example should 
demonstrate the normalized LMS algorithm. 
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(a) (b) NLMS 





(c) 



(d) EVR = 668.13250 





Fig. 8.15. Learning curves for the normalized LMS algorithm using the system 
identification configuration shown in Fig. 8.12. (a) The reference signal input x[n] 
to the adaptive filter and the “unknown” system, (b) Coefficient learning over time 
for the normalized LMS. (c) Step size fj used for LMS and NLMS. (d) Average 
over 50 learning curves. 



Example 8.4: Normalized LMS 

Suppose we have again the system identification configuration from Fig. 8.12 
(p. 381), only this time the input signal :r[ra] to the adaptive filter and the “un- 
known system” is the noisy pulse-amplitude-modulated (PAM) signal shown 
in Fig. 8.15a. For the conventional LMS we compute first r xa; [0], and calculate 
p max = 0.0118. The step size for the normalized LMS algorithm is adjusted 
depending on the momentary power ^2x[n] 2 of the reference signal. For the 
computation of ^nlms[u] shown in Fig. 8.15c it can be seen that at times 
when the absolute value of the reference signal is large the step size is re- 
duced and for small absolute values of the reference signal, a larger step size 
is used. The adaptation of the coefficient displayed over time in Fig. 8.15b 
reflects this issue. Larger learning steps can be seen at those times when 
A*nlms M is larger. An average over 50 adaptations is shown in the learning 
curves in Fig. 8.15d Although the EVR of the noisy PAM is larger than 600, it 
can be seen that the normalized LMS has a positive effect on the convergence 
behavior of the algorithm. | 8.4 | 
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The power estimation using (8.38) is a precise power snapshot of the 
current data vector x[n\. It may, however, be desired to have a longer memory 
in the power computation to avoid a temporary small value and a large fi 
value. This can be accomplished using a recursive update of the previous 
estimations of the power, with 

P[n] = j3P[n - 1] + (1 - (3)\x[n]\ 2 , (8.41) 

with f3 less than but close to 1. For a nonstationary signal such as the one 
shown in Fig. 8.15 the choice of the parameter f3 must be done carefully. If 
we select f3 too small the NLMS will more and more have the performance 
of the original LMS algorithm, see Exercise 8.14 (p. 421). 



8.4 Transform Domain LMS Algorithms 

LMS algorithms that solve the filter coefficient adjustment in a transform 
domain have been proposed for two reasons. The goal of the fast convolution 
techniques [229] is to lower the computational effort, by using block update 
and transforming the convolution to compute the adaptive filter output and 
the filter coefficient adjustment in the transform domain with the help of a 
fast cyclic convolution algorithm. The second method that uses transform do- 
main techniques has the main goal to improve the adaptation rate of the LMS 
algorithm, because it is possible to find transforms that allow a "decoupling 1 ’ 
of the modes of the adaptive filter [228, 230]. 

8.4.1 Fast- convolution Techniques 

Fast cyclic convolution using transforms like FFTs or NTTs can be applied 
to FIR filters. For the adaptive filter this leads to a block-oriented processing 
of the data. Although we may use any block size, the block size is usually 
chosen to be twice the size of the adaptive filter length so that the time 
delay in the coefficient update becomes not too large. It is also most often 
from a computational effort a good choice. In the first step a block of 2 L 
input values x[n\ are convolved via transform with the filter coefficients fi, 
which produces L new filter output values y[n\. These results are then used 
to compute L error signals e[n]. The filter coefficient update is then done also 
in the transform domain, using the already transformed input sequence x\n\. 
Let us go through these block processing steps using a L = 3 example. We 
compute the three filter output signals in one block: 

y[n] = f 0 x[n] + fi[n]x[n - 1] + f 2 [n]x[n - 2] 
y[n + 1] = fox[n + 1] + fi[n]x[n] + f 2 [n]x[n - 1] 
y[n + 2] = f 0 x[n + 2] + fi[n\x[n + 1] + f 2 [n]x[n\. 

These can be interpreted as a cyclic convolution of 
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{/o,/i,/ 2 , 0 , 0, 0} ® {x[n + 2\,x[n + 1], x[n\, x[n - 1], x[n - 2], 0}. 



The error signals follow then with 

e[n\ = d[n] — y[n] e[n + 1] = d[n + 1] - y[n + 1] 
e[n + 2] = d[n + 2] — y[n + 2]. 

The block processing for the filter gradient V can now be written as 
V[n] = e[n]x[n] V[w + 1] = e[n + l\x[n + 1] 

Y[n + 2] = e[n + 2 ]x[n + 2]. 

The update for each individual coefficient is then computed with 

Yo = e[n]x[n] + e[n + l]a?[yi + 1] + e[n + 2]x[n + 2] 

Yi = e[n]x[n — 1] + e[n + 1 ]x[n] + e[n + 2 ]x[n — 1] 

Y 2 = e[n]x[n — 2] + e[n -f 1 ]x[n + 1] + e[n + 2]a?[n]. 

We again see that this is a cyclic convolution, only this time the input se- 
quence x[n] appears in reverse order 



{0, 0, 0, e [n ] , e [n + 1], e[n + 2]} 

© {0, x[n - 2], x[n - 1], x[n], x[n+ 1 },x[n + 2]}. 



In the Fourier domain the reverse order in time yields that we need to com- 
pute the conjugate transform of X. The coefficient update then becomes 



f[n + L] = f[n]+^-V[n]. (8.42) 

Figure 8.16 shows all the necessary steps, when using the FFT for the 
fast convolution. 

From the stability standpoint the block delay in the coefficient is not 
uncritical. Feuer [231] has shown that the step size has to be reduced to 



0 < ii b < 



2 B 

(B + 2) x trace (-Rxa;) 



2 

(1 + 2 /B)L x r xx [0] 



(8.43) 



for a block update of B steps each. If we compare this result with the result 
for fi max from (8.28) page 379 we note that the values are very similar. Only 
for large block sizes B » L will the change in ji B have considerable impact. 
This reduces to (8.28) for a block size of B = 1. However, the time constant 
is measured in blocks of L data and it follows that the largest time constant 
for the BLMS algorithm is L times larger then the largest time constant 
associated with the LMS algorithm. 



8.4.2 Using Orthogonal Transforms 

We have seen in Sect. 8.3.1 (p. 381) that the LMS algorithm is highly sensitive 
to the eigenvalue ratio (EVR). Unfortunately, many real-world signals have 
high EVRs. Speech signals, for instance, may have EVR of 1874 [228]. But it is 
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Fig. 8 . 16 . Fast transform domain filtering method using the FFT. 



also well known that the transform domain algorithms allow a “decoupling 1 ’ of 
the mode of the signals. The Karhunen-Loeve transform (KLT) is the optimal 
method in this respect, but unfortunately not a real time option, see Exercise 
8.11 (p. 421). Discrete cosine transforms (DCT) and fast Fourier transform 
(FFT), followed by other orthogonal transforms like Walsh, Haclamard, or 
Haar are the next best choice in terms of convergence speed, see Exercise 
8.13, (p. 421) [232, 233]. 

Let us try in the following to use this concept to improve the learning 
rate of the identification experiment presented in Sect. 8.3.1 (p. 381), where 
the adaptive filter has to “learn” an impulse response of an unknown 16-tap 
FIR filter, as shown in Fig. 8.12 (p. 381). In order to apply the transform 
techniques and still to monitor the learning progress we need to compute in 
addition to the LMS algorithm 8.2 (p. 376) the DCT of the incoming reference 
signal x[n\ as well as the IDCT of the coefficient vector f n . In a practical 
application we do not need to compute the IDCT, it is only necessary to 
compute it once after we reach convergence. The following MatLab code 
demonstrates the transform domain DCT-LMS algorithm. 

for k = L: Iterations # / 0 adapt over full length 

x = [xin;x(l :L-1)] ; °/ 0 get new sample 
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(a) EVR=1 (b) EVR=10 




Fig. 8.17. Optimal step size for the DCT-LMS transform domain algorithm using 
the system identification configuration shown in Fig. 8.12 (p. 381) for four differ- 
ent eigenvalue ratios, (a) Eigenvalue ratios of 1. (b) Eigenvalue ratios of 10. (c) 
Eigenvalue ratios of 100. (d) Eigenvalue ratios of 1000. 



din = g } *x + n(k) ; 
z = dct (x) ; 
y = * z; 

err = din-y; 
f = f + err*mu.*z; 
f i = idct (f ) ; 



# /o "unknown" filter output + AWGN 
7. LxL orthogonal transform 
°/ 0 transformed filter output 
*/* error: primary - reference 
*/, update weight vector 
# /, filter in original domain 
J(k-L+1) = J(k-L+1) + sum( (f i-g) . ~2) ; */, Learning curve 
end 



The effect of a transform T on the eigenvalue spread can be computed 
via 

R zz = TR xx T h , (8.44) 

where the superscript H denotes the transpose conjugate. 

The only thing we have not considered so far is that the L “modes” or fre- 
quencies of the transformed input signal z[l\ are now more or less statistically 
independent input vectors and the step size [i in the original domain may no 
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longer be appropriate to guarantee stability, or allow fast convergence. In 
fact, the simulations by Lee and Un [233] show that if no power normaliza- 
tion is used in the transform domain then the convergence did not improve 
compared with the time-domain LMS algorithm. It is therefore reasonable 
to compute for these L spectral components different step sizes according 
to the stability bound (8.28), p. 379, just using the power of the transform 
components: 

2 

t*mzx[k]=- f-r for k = 0, 1, . . L — 1. 

3 x L x r ZZ]k [ 0] 

The additional effort is now the computation of the power normalization 
of all L spectral components. The MatLab code above already includes a 
componentwise update via mu. *z, where the . * stands for the componentwise 
multiplication. 

The adjustment in fi is somewhat similar to the normalized LMS algo- 
rithm we have discussed before. We may therefore use directly the power 
normalization update similar to (8.39) p. 384 for the frequency component. 
The effect of power normalization and transform T on the eigenvalue spread 
can be computed via 

H„=A- 1 TJl xx r"A- 1 , ( 8 . 45 ) 

where A -1 is a diagonal matrix that normalizes R zz in such a way that the 
diagonal elements all become 1, see [232]. 

Figure 8.17 shows the computed step sizes for four different eigenvalue 
ratios of the L — 16 FIR filter. For a pure Gaussian input all spectral com- 
ponents should be equal and the step size is almost the same, as can be seen 
from Fig. 8.17a. The other filter shapes the noise in such a way that the 
power of these spectral components is increased (decreased) and the step size 
has to be set to a lower (higher) value. 

From Fig. 8.18 the positive effect on the performance of the DCT-LMS 
transform-domain approach can be seen. The learning converges, even for 
very high eigenvalue ratios like 1000. Only the error floor and consistency of 
the error at —48 dB is not reached as well for high EVRs as for the lower 
EVRs. 

One factor that must be considered in choosing the transform for real-time 
application algorithms is the computational complexity. In this respect, real 
transforms like DCT or DST transforms are superior to complex transform 
like the FFT, transforms with fast algorithms are better than the algorithms 
without. Integer transforms like Haar or Hadamard, that do not need multi- 
plications at all, are desirable [232]. Lastly, we also need to take into account 
that the RLS (discussed later) is another alternative, which has, in general, a 
higher complexity than the LMS algorithm, but may be more efficient than 
a transform domain filter approach and also yield as fast a convergence as 
the KLT-based LMS algorithm. 
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Fig. 8.18. Learning curves for the DCT transform domain LMS algorithm using 
the system identification configuration shown in Fig. 8.12 (p. 381) for an average 
of 50 cycles using four different eigenvalue ratios. 

8.5 Implementation of the LMS Algorithm 

We now wish to look at the task to implement the LMS algorithm with 
FPGAs. Before we can proceed with a HDL design, however, we need to 
ensure that quatization effects are tolerable. Later in this section we will 
then try to improve the throughput by using pipelining, and we need to 
ensure then also that the ADF is still stable. 

8.5.1 Quantization Effects 

Before we can start to implement the LMS algorithm in hardware we need 
to ensure that the parameter and data are well in the “green” range. This 
can be done if we change the software simulation from full precision to the 
desired integer precision. Figure 8.19 shows the simulation for 8-bit integer 
data and /i = 1/4, 1/8 and 1/16. Note that we can not choose fi too small, 
otherwise we will no longer get convergence through the large scaling of the 
gradient e[n]x[n] with ji in the coefficient update equation (8.18), p. 376. 
The smaller the step size fi the more problem the algorithm has to converge 
to the optimal values f 0 — 43.3 and fi = 25. This is somehow a contrary 
requirement to the upper bound on given through the stability requirement 
of the algorithm. It can therefore be necessary to add fractional bits to the 
system to overcome these two contradictions. 
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Fig. 8.19. Simulation of the power- line interference cancellation using the LMS 
algorithm for integer data, (left) System output e[rc]. (right) filter coefficients. 



8.5.2 FPGA Design of the LMS Algorithm 

A possible implementation of the algorithm represented as a signal flow graph 
is shown in Fig. 8.20. From a hardware implementation standpoint we note 
that we need one scaling for // and 2 L general multipliers. The effort is there- 
fore more than twice the effort of the programmable FIR filter as discussed 
in Chap. 3, Example 3.1 (p. 111). 

We wish to study in the following the FPLD implementation of the LMS 
algorithm. 

Example 8.5: Two-tap Adaptive LMS FIR Filter 

The VHDL design 3 for a filter with two coefficients /o and /i with a step 
size of (i = 1/4 is shown in the following listing. 

— This is a generic LMS FIR filter generator 
— It uses W1 bit data/coefficients bits 

LIBRARY 1pm; — Using predefined packages 

3 The equivalent Verilog code f ir_lms . v for this example can be found in Ap- 
pendix A on page 481. 
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Fig. 8.20. Signal flow graph of the LMS algorithm. 



USE 1pm. lpm_components. ALL; 



LIBRARY ieee ; 

USE ieee . std_logic_1164. ALL; 

USE ieee . std_logic_arith. ALL; 

USE ieee . std_logic_signed. ALL; 

ENTITY fir.lms IS > Interface 

GENERIC (W1 : INTEGER := 8; — Input bit width 

W2 : INTEGER := 16; — Multiplier bit width 2*W1 
L : INTEGER := 2 — Filter length 

IN STD.LOGIC; 

IN STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 
IN STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 
OUT STD_L0GIC_ VECTOR (W2-1 DOWNTO 0) ; 
OUT STD_L0GIC_VECT0R(W1-1 DOWNTO 0)); 

hiML> nr.ims; 

ARCHITECTURE flex OF fir.lms IS 

SUBTYPE N1BIT IS STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 

SUBTYPE N2BIT IS STD_L0GIC_VECT0R(W2-1 DOWNTO 0) ; 

TYPE ARRAY.N1BIT IS ARRAY (0 TO L-l) OF N1BIT ; 

TYPE ARRAY_N2BIT IS ARRAY (0 TO L-l) OF N2BIT ; 



PORT ( elk 
x_in 
d_in 

e_out , y_out 
fO_out, fl_out 



SIGNAL d : N1BIT ; 

SIGNAL emu : MBIT; 

SIGNAL y, sxty : N2BIT ; 

SIGNAL e, sxtd : N2BIT ; 

SIGNAL x, f : ARRAY_N1BIT ; — Coeff/Data arrays 
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SIGNAL p, xemu : ARRAY_N2BIT; — Product arrays 

BEGIN 

dsxt : PROCESS (d) — 16 bit signed extension for input d 

BEGIN 

sxtd(7 DOWNTO 0) <= d; 

FOR k IN 15 DOWNTO 8 LOOP 
sxtd(k) <= d(d’high) ; 

END LOOP; 

END PROCESS; 



Store: PROCESS > Store these data or coefficients 

BEGIN 

WAIT UNTIL elk = ’1’ ; 
d <= d_in; 
x(0) <= x_in; 
x ( 1 ) <= x (0) ; 

f (0) <= f(0) + xemu(O) (15 DOWNTO 8); — implicit 
f (1) <= f (1) + xemu(l) (15 DOWNTO 8); — divide by 2 
END PROCESS Store; 

MulGenl : FOR I IN 0 TO L-l GENERATE 

FIR: lpm_mult — Multiply p(i) = f(i) * x(i); 

GENERIC MAP ( LPM.WIDTHA => Wl, LPM.WIDTHB => Wl, 
LPM_REPRESENTATION => "SIGNED'’, 
LPM_WIDTHP => W2 , 

LPM.WIDTHS => W2) 

PORT MAP ( dataa => x(I) , datab => f(I), 

result => p (I) ) ; 

END GENERATE; 

y<=p(0)+p(l); — Compute ADF output 

ysxt: PROCESS (y) — Scale y by 128 because x is fraction 
BEGIN 

sxty (8 DOWNTO 0) <= y(15 DOWNTO 7); 

FOR k IN 15 DOWNTO 9 LOOP 
sxty(k) <= y(y’high); 

END LOOP; 

END PROCESS; 

e <= sxtd - sxty; 

emu <= e(8 DOWNTO 1); — e*mu divide by 2 and 

— 2 from xemu makes mu=l/4 
MulGen2 : FOR I IN 0 TO L-l GENERATE 

FUPDATE: lpm_mult — Multiply xemu(i) = emu * x(i) ; 

GENERIC MAP ( LPM.WIDTHA => Wl, LPM.WIDTHB => Wl, 
LPM_REPRESENTATION => "SIGNED", 
LPM.WIDTHP => W2 , 

LPM_WIDTHS => W2) 

PORT MAP ( dataa => x(I) , datab => emu, 

result => xemu(I)); 
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Fig. 8.21. VHDL simulation of the power-line interference cancellation using the 
LMS algorithm. 



END GENERATE; 

y_out <= y; — Monitor some test signals 
e_out <= e; 
f 0_out <= f (0) ; 
f l_out <= f (1) ; 

END flex; 

The design is a literal interpretation of the adaptive LMS filter architecture 
found in Fig. 8.20 (p. 393). The output of each tap of the tapped delay line is 
multiplied by the appropriate filter coefficient and the results are added. The 
response of the adaptive filter y and of the overall system e to a reference 
signal x and a desired signal d is shown in Fig. 8.21. The filter adapts after 
approximately 20 steps at 1 ps to the optimal values /o = 43.3 and / 1 =25. 
Note that MaxPlusII displays negative numbers as unsigned numbers, e.g., 
— 10 is displayed as 256 — 10 = 246. The design consumes 612 logic cells and 
runs with a Registered Performance of 9.0MHz. | 8.5 | 



The previous example also shows that the standard LMS implementation 
has a low Registered Performance due to the fact that two multipliers and 
several add operations have to be performed in one clock cycle before the 
filter coefficient can be updated. In the following section we wish therefore 
to study how to achieve a higher throughput. 



8.5.3 Pipelined LMS Filters 

As can be seen from Fig. 8.20 (p. 393) the original LMS adaptive filter has 
a long update path and hence the performance already for 8-bit data and 
coefficients is relatively slow. It is therefore no surprise that many attempts 
have been made to improve the throughput of the LMS adaptive filter. The 
optimal number of pipeline stages from Fig. 8.20 (p. 393) can be computed 
as follows: For the ( b x b) multiplier /& a total of log 2 {b) stages are needed, 
see also (2.30) p. 61. For the adder tree an additional log 2 (T) pipeline stages 
would be sufficient and one additional stage for the computation of the error. 
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The coefficient update multiplication requires an additional log 2 (&) pipeline 
stages. The total number of pipeline stages for a maximum throughput are 
therefore 

A> P t = 21og 2 (6) + log 2 (i) + 1, (8.46) 

where we have assumed that // is a power-of-two constant and the scaling with 
fi can be done without the need of additional pipeline stages. If, however, the 
normalized LMS is used, then (i will no longer be a constant and depending 
on the bit width of (i additional pipeline stages will be required. 

Pipelining an LMS filter is not as simple as for an FIR filter, because the 
LMS has, as the IIR filter, feedback. We need therefore to ensure that the 
coefficient of the pipelined filter still converges to the same coefficient as the 
adaptive filter without pipelining. Most of the ideas to pipeline IIR filters can 
be used to pipeline an LMS adaptive filter. The suggestion include 

• Delayed LMS [224, 234, 235] 

• Look-ahead transformation of the pipelined LMS [219, 236, 237] 

• Transposed form LMS filter [238] 

• Block transformation using FFTs [229] 

We have already discussed the block transform algorithms and now wish 
in the following to briefly review the other techniques to improve the LMS 
throughput. 

The Delayed LMS Algorithm. In the delayed LMS algorithm (DLMS) 
the assumption is that the gradient of the error V[n] = e[n]®[n] does not 
change much if we delay the coefficient update by a couple of samples, i.e., 
V[rc] £8 V[n — D\. It has been shown [234, 235] that as long as the delay is 
less than the system order, i.e., filter length, this assumption is well true and 
the update does not degrade the convergence speed. Long’s original DLMS 
algorithm only considered pipelining the adder tree of the adaptive filter as- 
suming also that multiplication and coefficient update can be done in one 
clock cycle (like for programmable digital signal processors [224]), but for a 
FPGA implementation multiplier and the coefficient update requires addi- 
tional pipeline stages. If we introduce a delay of D i in the filter computation 
path and D 2 in the coefficient update path the LMS Algorithm 8.2 (p. 376) 
becomes: 



e[n — D\] = d[n — D{\ — f T [n — Di]x[n — D\] 
f[n + 1] = f[n - Di - D 2 \ + (J,e[n - D\ - D 2 \x[n - Lfi - D 2 \. 

The Look-ahead DLMS Algorithm. For long adaptive filters with D — 
D\ + D 2 < L the delayed coefficient update presented in the previous sec- 
tion, in general, does not change the convergence of the ADF much. It can, 
however, for shorter filters become necessary to reduce or even remove the 
change in system function completely. From the IIR pipelining method we 
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have discussed in Chap. 4, the time domain interleaving method can always 
be applied. We perform just a look-ahead in coefficient computation, without 
alternating the overall system. Let us start with the DLMS update equations 
with pipelining only in the coefficient computation, i.e., 

^DLMSj-^ _ jj j _ _ jj j _ x T [n — D\f[n — D] 

f[n + 1] = = f[n] + fie[n - D\x[n - D}. 

But the error function of the LMS would be 

e LMS [n — D\ = d[n — D\ — x T [n]f[n — D\. 

We follow the idea from Poltmann [236] and wish to compute the correc- 
tion term A[n], which cancels the change of the DLMS error computation 
compared with the LMS, i.e., 

A[n] = e LMS [n — D\ — e DLMS [n - D\. 

The error function of the DLMS is now changed to 

e DLMS[ n _ £)] _ d [ n _ £)] _ x T[ n _ £)]/[„ _ D] - A[n], 

We need therefore to determine the term 
A[n] = x T [n - D](f[n\ - f[n - D}). 

The term in brackets can be recursively determined via 
f[n] - f[n - D] 

= f[n - 1] + /ie{n - D - 1 ]x[n - Di] - f[n - D] 

= f[n — 2] + fie[n — D — 2 \x[n — D — 2] 

+jjLe[n — D — 1 ]x[n — D — 1] — f[n — D] 

D 

= fie[n — D — s]x[n — D — s]^ 

5 = 1 

and it follows for the correction term A[n] finally 

/ D 

A[n\ — x T [n — D\ J fie[n — D — s ]* [n — D — s] 

\s = 1 

g DLMS |- ?7 _ jrjj _ n — D] — x T [n — D\f[n — D] 

l D 

— ^ T [n — D\ I ^2 L ie [ n — D — s]x[n — D — 5 ] 

\ 5 = 1 

It can be seen that this correction term needs an additional 2D multiplication, 
which may be too expensive in some applications. It has been suggested 
[237] to “relax” the requirement for the correction term but some additional 
multipliers are still necessary. 
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We can, however, remove the influence of the coefficient update delay, by 
applying the look-ahead principle [219], i.e., 

D2 — 1 

f[n -F 1] = = f[n — D\\ + ^ e[n — l)\ — k\x[n — D\ — k]. (8.47) 

k = 0 

The summation in (8.47) builds the moving average over the last D 2 gradient 
values, and makes it intuitively clear that the convergence will proceed more 
smoothly. The advantage compared with the transformation from Poltmann 
is that this look-ahead computation can be done without a general multi- 
plication. The moving average in (8.47) may even be implemented with a 
first-order CIC filter (see Fig. 5.15, p. 190), which reduced the arithmetic 
effort to one adder and a subtractor. 

Similar approaches to the idea from Poltmann to improve the DLMS 
algorithm have also been suggested [239, 240, 241]. 

8.5.4 Transposed Form LMS Filter 

We have seen that the DLMS algorithm can be smoothed by introducing a 
look-ahead computation in the coefficient update, as we have used in HR 
filters, but is, in general, not without additional cost. If we use, however, the 
transposed FIR structure (see Fig. 3.3, p. Ill) instead of the direct structure, 
we can eliminate the delay by the adder tree completely. This will reduce the 
requirement for the optimal number of pipeline stages from (8.46), p. 396, 
by log 2 (T) stages. For a LTI system both direct and transposed filters are 
described by the same convolution equation, but for a time- varying coefficient 
we need to change the filter coefficient from 

fk[n ] to fk[n-k). (8.48) 

The equation for the estimated gradient (8.14) on page 376 now becomes 

v l"l = - 2 ' w w 1 (8 491 

= -2 e [n]*[n-l] / ‘^~ ] * 1 . (8.50) 

If we now assume that the coefficient update is relatively slow, i.e., fk[n — k] 
fk[n ] the gradient becomes, 

V[n] » -2e[»]sc[n], (8.51) 

and the coefficient update equation becomes: 

fk[n-k + l] = fk[n - k\ + fie[n]x[n]. (8.52) 

The learning characteristics of the transposed-form adaptive filter algo- 
rithms have been investigated by Jones [238], who showed that we will get 
a somewhat slower convergence rate when compared with the original LMS 
algorithm. The stability bound regarding fi also needs to be determined and 
is found to be smaller than for the LMS algorithm. 
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g= 1/4 





Fig. 8.22. 8-bit MatLab simulation of the power-line interference cancellation 
using the DLMS algorithm with a delay of 6. 



8.5.5 Design of DLMS Algorithms 

If we wish to pipeline the LMS filter from Example 8.5 (p. 392) we conclude 
from the discussion above (8.46) that the optimal number of pipeline stages 
becomes: 

D 0 pt =: 2 log 2 (6) + log 2 ( L ) + 1 = 2x34-1 + 1 = 8. (8.53) 

On the other hand, pipelining the multiplier can be done without additional 
costs and we may therefore consider only using 6 pipeline stages. Figure 8.22 
shows a MatLab simulation in 8-bit precision with a delay 6. Compared with 
the original LMS design from Example 8.5 (p. 392) it shows some “overswing” 
in the adaptation process. 

Example 8.6: Two-tap Pipelined Adaptive LMS FIR Filter 

The VHDL design 4 for a filter with two coefficients /o and /i with a step 
size of (i = 1/4 is shown in the following listing. 

— This is a generic DLMS FIR filter generator 
— It uses W1 bit data/coefficients bits 

LIBRARY 1pm; — Using predefined packages 

USE lpm. lpm_ components. ALL; 

LIBRARY ieee ; 

USE ieee . std_logic_1164. ALL; 

USE ieee . std_logic_arith. ALL; 

USE ieee . std_logic_signed. ALL ; 

ENTITY fir6dlms IS > Interface 

GENERIC (W1 : INTEGER := 8; — Input bit width 

W2 : INTEGER := 16;— Multiplier bit width 2*W1 
L : INTEGER := 2; — Filter length 

4 The equivalent Verilog code fir_lms.v for this example can be found in Ap- 
pendix A on page 483. 
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Delay : INTEGER := 3 — Pipeline Delay 

); 

PORT ( elk : IN STD_LOGIC; 

x_in : IN STD_LOGIC_VECTOR (Wl-1 DOWNTO 0) ; 

d_in : IN STD_L0GIC_VECT0R(W1-1 DOWNTO 0) ; 

e_out, y_out : OUT STD_L0GIC_VECT0R(W2-1 DOWNTO 0) ; 
fO.out, fl.out : OUT STD_L0GIC_VEGT0R(W1-1 DOWNTO 0)); 
END fir6dlms; 

ARCHITECTURE flex OF fir6dlms IS 

SUBTYPE N1BIT IS STD_L0GIC_VECT0R(W1-1 DOWNTO 0); 

SUBTYPE N2BIT IS STD_L0GIC_VECT0R(W2-1 DOWNTO 0) ; 

TYPE ARRAY.N1BITF IS ARRAY (0 TO L-l) OF N1BIT ; 

TYPE ARRAY.N1BITX IS ARRAY (0 TO Delay+L-1) OF N1BIT ; 

TYPE ARRAY.N1BITD IS ARRAY (0 TO Delay) OF N1BIT ; 

TYPE ARRAY.N1BIT IS ARRAY (0 TO L-l) OF N1BIT ; 

TYPE ARRAY.N2BIT IS ARRAY (0 TO L-l) OF N2BIT ; 

SIGNAL xemuO, xemul : N1BIT; 

SIGNAL emu : N1BIT ; 

SIGNAL y, sxty : N2BIT ; 

SIGNAL e, sxtd : N2BIT ; 

SIGNAL f : ARRAY_N1BITF 

SIGNAL x : ARRAY_N1BITX 

SIGNAL d : ARRAY.N1BITD 

SIGNAL p, xemu : ARRAY.N2BIT; 

BEGIN 

dsxt : PROCESS (d) — make d a 16 bit number 

BEGIN 

sxtd(7 DOWNTO 0) <= d(Delay) ; 

FOR k IN 15 DOWNTO 8 LOOP 
sxtd(k) <= d(3) (7) ; 

END LOOP; 

END PROCESS; 



Store: PROCESS > Store these data or coefficients 

BEGIN 

WAIT UNTIL elk = ’1 ’ ; 

d(0) <= d_in; — Shift register for desired data 
d(l) O d(0) ; 
d(2) <= d ( 1 ) ; 
d(3) <= d(2) ; 

x(0) <= x_in; — Shift register for data 
x ( 1 ) <= x (0) ; 
x (2) <= x ( 1 ) ; 
x (3) <= x (2) ; 
x (4) <= x (3) ; 



f (0) <= f(0) + xemu(O) (15 DOWNTO 8); — implicit 
f (1) <= f (1) + xemu(l) (15 DOWNTO 8); — divide by 2 



— Coefficient array 

— Data array 

— Reference array 

— Product array 
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END PROCESS Store; 

MulGenl: FOR I IN 0 TO L-l GENERATE 

FIR: lpm_mult — Multiply p(i) = f(i) * x(i) ; 

GENERIC MAP ( LPM.WIDTHA => Wl, LPM.WIDTHB => Wl, 
LPM.REPRESENTATION => "SIGNED", 
LPM.PIPELINE => Delay, 

LPM_WIDTHP => W2, 

LPM.WIDTHS => W2) 

PORT MAP ( dataa => x(I) , datab => f (I) , 

result => p(I) , clock => elk) ; 

END GENERATE; 

y <= p(0) + p(l) ; — Computer ADF output 

ysxt : PROCESS (y) — scale y by 128 because x is fraction 
BEGIN 

sxty (8 DOWNTO 0) <= y(15 DOWNTO 7); 

FOR k IN 15 DOWNTO 9 LOOP 
sxty(k) <= yCy’high); 

END LOOP; 

END PROCESS; 

e <= sxtd - sxty; — e*mu divide by 2 and 2 

emu <= e(8 DOWNTO 1); — from xemu makes mu=l/4 

MulGen2 : FOR I IN 0 TO L-l GENERATE 

FUPDATE: lpm_mult — Multiply xemu(i) = emu * x(i) ; 

GENERIC MAP ( LPM.WIDTHA => Wl, LPM.WIDTHB => Wl, 
LPM_REPRESENTATION => "SIGNED", 
LPM_PIPELINE => Delay, 

LPM.WIDTHP => W2 , 

LPM_WIDTHS => W2) 

PORT MAP ( dataa => x(I+Delay) , datab => emu, 

result => xemu (I) , clock => elk) ; 

END GENERATE; 

y_out <= y; — Monitor some test signals 
e_out <= e ; 
f 0_out <= f (0) ; 
f l_out <= f (1) ; 

END flex; 

The design is a literal interpretation of the adaptive LMS filter architecture 
found in Fig. 8.20 (p. 377) with the additional delay of 3 pipeline stages for 
each multiplier. The output of each tap of the tapped delay line is multiplied 
with the appropriate filter coefficient and the results are added. Note the 
additional delays for x and d in the verb — Store: PROCESS — to make the 
signals coherent. The response of the adaptive filter y and of the overall 
system e to a reference signal x and a desired signal d is shown in the VHDL 
simulation in Fig. 8.23. The filter adapts after approximately 30 steps at 
1.5 ps to the optimal values /o = 43.3 and / 1 = 25. But it also shows some 
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Table 8.2. Size and performance data of different pipeline options of the DLMS 
algorithms. 



D 


LEs 


MHz 


Comment 


0 


612 


9.0 


original LMS 


1 


628 


14.90 


original DLMS 


3 


612 


17.88 


pipeline of f update only 


6 


658 


23.41 


pipeline all multiplier 


8 


682 


46.72 


optimal number of stages 




Fig. 8.23. VHDL simulation of the power- line interference cancellation using the 
DLMS algorithm with a delay of 6. 



“overswing” in the adaptation process. The design consumes 658 logic cells 
and runs with a Registered Performance of 23.41 MHz. | 8.6 | 



Compared with the previous example we may also consider other pipelin- 
ing options. We may, for instance, use pipelining only in the coefficient up- 
date, or we may implement the optimal number of pipeline stages, i.e., 8. 
Table 8.2 gives an overview of the different options. 

From Table 8.2 it can be seen that compared to the original LMS algo- 
rithm we may gain up to a factor of 4 speed improvement, while at the same 
time the additional hardware cost are only about 10%. The additional effort 
comes from the extra delays of the reference data d[n] and the filter input 
x[n]. The limitation is just that it may become necessary for large pipeline 
delays to adjust fi in order to guarantee stability. 



8.5.6 LMS Designs using SIGNUM Function 

We saw in the previous section that the implementation cost of the LMS 
algorithm is already high for short filter length. The highest cost of the filter 
comes from the large number of general multipliers and the major goal in 
reducing the effort is to reduce the number of multipliers. Obviously the FIR 
filter part can not be reduced, but different simplifications in the computation 
of the coefficient update have been investigated. Given the fact that to ensure 
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Fig. 8.24. Simulation of the power- line interference cancellation using the 3 sim- 
plified signed LMS (SLMS) algorithms, (left) System output e[n], (right) filter 
coefficients. 



stability usually the step size is chosen much smaller than ^ max , the following 
suggestions have been made: 

• Use only the sign of the reference data x\n\ not the full precision value to 
update the filter coefficients. 

• Use only the sign of the error ' e[n] not the full precision value to update 
the filter coefficients. 

• Use both of the previous simplifications via the sign of error and data. 

The three modifications can be described with the following coefficient 
update equations in the LMS algorithm: 

/[„, + 1] = /[„.] + (i x e[n] x sign (a? [n]) sign data function 

f\n + 1] = f[n] + (i x x[n] x sign(e[n]) sign error function 

f\n + 1] = f[n] -f fix sign(e[n]) x sign (a; [n]) sign-sign function. 
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We note from the simulation of the three possible simplifications shown 
in Fig. 8.24 that for the sign-data function almost the same result occurs 
as in the full precision case. This is no surprise, because our input refer- 
ence signal x[n] = cos[7rn/2 + <ft\ will not be much quantized through the 
sign operation anyway. This is much different for the sign-error function. 
Here the quantization through the sign operation essentially alters the time 
constant of the system. But finally, after about 2.5 s the correct values are 
reached, although from the system output e[n\ we note the essential ripple in 
the output function even after a long simulation time. Finally, the sign-sign 
algorithm converges faster than the sign-error algorithm, but here also the 
system output shows essential ripple for e[n\. From the simulation it can be 
seen that the sign-function simplification (to save the L multiplications in the 
filter coefficient update) has to be evaluated carefully for the specific appli- 
cation to still guarantee a stable system and acceptable time constants of the 
system. In fact, it has been shown that for specific signals and application 
the sign algorithms does not converge, although the full precision algorithm 
would converge. Besides the sign effect we also need to ensure that the integer 
quantization through the implementation does not alter the desired system 
properties. 

Another point to consider when using the sign function is the error floor 
that can be reached. This is discussed in the following example. 

Example 8.7: Error Floor in Signum LMS Filters 

Suppose we have a system identification configuration as discussed in Sect. 
8.3.1 (p. 381), and we wish to use one of the signum- type ADF algorithms. 
What will then be the error floor that can be reached? Obviously through the 
signum operation we will lose some precision and we expect that we will not 
reach the same low-noise level as with a full-precision LMS algorithm. We 
also expect that the learning rate will be somewhat decreased when compared 
with the full-precision LMS algorithm. This can be verified by the simulation 
results shown in Fig. 8.25 for an average over 50 learning curves and two dif- 
ferent eigenvalue ratios (EVRs). The sign data algorithms shows some delay 
in the adaptation when compared with the full-precision LMS algorithm, but 
reaches the error floor, which was set to —60 dB. Signed error and sign-sign 
algorithms show larger delays in the adaptation and also reach only an error 
floor of about —40 dB. This larger error may or may not be acceptable for 
some applications. | s.7 | 



The sign-sign algorithm is attractive from a software or hardware imple- 
mentation standpoint and has been used for the International Telecommuni- 
cation Union (ITU) standard for adaptive differential pulse code modulation 
(ADPCM) transmission. From a hardware implementation standpoint we 
actually do not need to implement the sign-sign algorithm, because the mul- 
tiplication with n is just a scaling with a constant and one of the single sign 
algorithms will already allow us to save the L multipliers we usually need for 
the filter coefficient update in Fig. 8.20 (p. 393). 
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(a) LMS (b) sign(data) 




Fig. 8.25. Simulation of the system identification experiment using the 3 simplified 
signed LMS algorithms for an average of 50 learning curves for an error floor of 
—60 dB. (a) LMS with full precision, (b) signed data, (c) signed error algorithms, 
(d) sign-sign LMS algorithm. 



8.6 Recursive Least Square Algorithms 

In the LMS algorithm we have discussed in the previous sections the fil- 
ter coefficients are gradually adjusted by a stochastic gradient method to 
finally approximate the Wiener-Hopf optimal solution. The recursive least 
square (RLS) algorithm takes another approach. Here, the estimation of the 
( L x L ) autocorrelation matrix R xx and the cross-correlation vector r & x are 
iteratively updated with each new incoming data pair (#[«], d[n]). The sim- 
plest approach would be to reconstruct the Wiener-Hopf equation (8.9), i.e., 
R xx f opt = Td x and resolve it. However, this would be the equivalent of one 
matrix inversion as each new data point pair arrives and has the potential of 
being computationally expensive. The main goal of the different RLS algo- 
rithms we will discuss in the following is therefore to seek a (iterative) time 
recursion for the filter coefficients f\n-\- 1] in terms of the previous least square 
estimate f[n] and the new data pair (x[n], d[n]). Each incoming new value x[n] 
is placed in the length- L data array x[n] = [a?[n]a?[n — 1] . . . x[n — (L — 1)]] . 

We then wish to add x[n]a?[n] to jR®*[0,0], x[n]x[n — 1] to #^[0,1], etc. 
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Mathematically we just compute the product xx T and add this ( L x L) ma- 
trix to the previous estimation of the autocorrelation matrix R xx [n]. The 
recursive computation may be computed as follows: 

n 

Rxx[ n + 1] — + x[n\x T [n\ = ^ a5[s]a5 T [s]. (8.54) 

5—0 

For the cross-correlation vector r dx [n + 1] we also build an "‘improved” es- 
timate by adding with each new pair ( x[n],d[n ]) the vector d[n]x[n] to the 
previous estimation of r dx [n]. The recursion for the cross-correlation becomes 

r dx [n + 1] = r dx [n] + d[n]x[n], (8.55) 

we can now use the Wiener-Hopf equation in a time recursive fashion and 
compute 



Rxx[n + l]/ op t[n + 1] = r dx [n + 1]. (8.56) 

For the true estimates of cross- and autocorrelation matrices we would need to 
scale by the number of summations, which is proportional to n, but the cross- 
and autocorrelation matrices are scaled by the same factor, which cancel each 
other out in the iterative algorithm and we get for the filter coefficient update 

/opt[« + 1] = Rw~ l [n + 1 ]r d v[n + 1]. (8.57) 

Although this first version of the RLS algorithms is computationally inten- 
sive (approximately L 3 operations are needed for the matrix inversion) it 
still shows the principal idea of the RLS algorithm and can be quickly pro- 
grammed, for instance in MatLab, as the following code segment shows the 
inner loop for length- L RLS filter algorithm: 



x = [xin;x(l : L— 1 ) ] ; 
y = f ’ * x; 
err = din - y; 

Rxx = Rxx + x*x>; 
rdx = rdx + din . * x; 
f = Rxx~(-1) * rdx; 



'/, get new sample 
7. filter output 

7* error: reference - filter output 
7. update the autocorrelation matrix 
7, update the cross-correlation vector 
7* compute filter coefficients 



where Rxx is a (L x L) matrix and rdx is a (L x 1) vector. The cross-correlation 
vector is usually initialized with r dx [ 0] = 0. The only problem with the 
algorithm so far arises at the first n < L iterations, when R xx [n ] only has a 
few nonzero entries, and consequently will be singular and no inverse exists. 
There are a couple of ways to tackle this problem: 



• We can wait with the computation of the inverse until we find that the 
autocorrelation matrix is nonsingular, i.e., det ( R xx M) > °- 

♦ We can use -R+Jn] = the so-called pseudoin- 

verse, which is a standard result in linear algebra regarding the solution of 
an overdetermined set of linear equations. 
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Fig. 8 . 26 . Learning curves of the RLS algorithms using different initialization of 
Rxx[®] = SI or .RsccclP] = S~ 1 1. High S/N is -48 dB and low is -10 dB. <5 = 1000, 1 
or 1/1000. 



• We can initialize the autocorrelation matrix R x & with SI where S is chosen 
to be a small (large) constant for high (low) S/N ratio of the input signal. 

The third approach is the most popular due to the computational benefit 
and the possibility to set an initial “learning rate” using the constant 5. The 
influence of the initialization in the RLS algorithm for an experiment similar 
to Sect. 8.3.1 (p. 381) with an average over 5 learning curves is shown in 
Fig. 8.26. The upper row shows the full-length simulation over 4000 itera- 
tions, while the lower row shows the first 100 iterations only. For high S/N 
(—48 dB) we may use a large value for the initialization, which yields a fast 
convergence. For low S/N values (—10 dB) small initialization values should 
be used, otherwise large errors at the first iterations can occur, which may 
or may not be tolerable for the specific application. 

A more computationally attractive approach than the first “brute force” 
RLS algorithm will be discussed in the following. The key idea is that we do 
not compute the matrix inversion at all and use a time recursion directly for 
Rjvx ~ 1 M , we actually will never have (or need) R xx [n] available. To do so, 
we substitute the Wiener equation for time n + 1, i.e., f[n + 1 ]R xx [n + 1] = 




408 8. Adaptive Filters 



+ 1] into (8.55) it follows that 

+ !]/[« + 1] = Rxx[n]f[n] + d[n + 1 ]x[n + 1]. (8.58) 

Now we use (8.54) to get 

R xx [n + 1 ]f[n + 1] = (Rx*[ n + !] - x i n + 1 \x T [n + 1]) /[n] 

+d[n + l]as[rc + 1]. (8.59) 

We can rearrange (8.59) by multiplying by R~ x [n + 1] to have f[n + 1] on 
the lefthand side of the equation: 

f[n + 1] = f[n] + R~i[n + 1 }x[n + 1] ( d[n + 1] - f T [n\x[n + 1]) 

' '' „ ' 

k[n + l] e[n + l] 

= f[n\ + k[n + 1 ]e[n + 1], 



where the a priori error is defined as 

e[n + 1] = d[n + 1] - f T [n\x[n + 1], 

and the Kalman gam vector is defined as 

k[n + 1] = R xx [n + l]*[ w + 1]- (8.60) 

As mentioned above the direct computation of the matrix inversion is com- 
putationally intensive, and it is much more efficient to use again the iteration 
equation (8.54) to actually avoid the inversion at all. We use the so-called 
“matrix inversion lemma,” which can be written as the following matrix iden- 
tity 

{A + BCD)- 1 

= A" 1 - A~ l B(A~ l BD A~ l ){C + DA l B) \ 



which holds for all matrices A, B , C, and D } of compatible dimensions and 
nonsingular A. We make the following associations: 



A=R xx [n -f 1] B—x[n] 

C - 1 D—x T \n\. 

The iterative equation for R~ x becomes: 

#**[« + 1] = (RxlW + x[n\x T [n]) 

„-u i R xi [n]x[n]x T [n}R~l[n\ 

xxW 1 + x T [n]R^i[n}x[n] 



(8.61) 



If we use the Kalman gain factor k[n] from (8.60) we can rewrite (8.61) more 
compactly as: 



R X xi n + !] = {R x l[n] + x[n+l\x T [n+ 1]) 

k[n]k T [n] 



-l 



— R xx i n ] + 



1 + x T [n]k[n\ 
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Fig. 8.27. Basic configuration for interference cancellation using the RLS algo- 
rithm. 



This recursion is as mentioned before initialized [213] with 

p-irni — ST S / ^ ar § e positive constant for high SNR 

W1 1 small positive constant for low SNR. 



With this recursive computation of the inverse autocorrelation matrix the 
computation effort is now proportional to L 2 , an essential saving for large 
values of L. Figure 8.27 shows a summary of the RLS adaptive filter algo- 
rithm. 



8.6.1 RLS with Finite Memory 

As we can see from (8.54) and (8.55) the adaptive algorithm derived so far 
has an infinite memory. The values of the filter coefficients are functions of 
all past inputs starting with time zero. As will be discussed next it is often 
useful to introduce a “forgetting factor” into the algorithm, so that recent 
data are given greater importance than older data. This not only reduces the 
influence of older data, it also accomplishes that through the update of the 
cross- and autocorrelation with each new incoming data pair no overflow in 
the arithmetic will occur. One way of accomplishing a finite memory is to 
replace the sum-of-squares cost function, by an exponentially weighted sum 
of the output: 

n 

J = J2p n ~ S e 2 [s\, 

5=0 



(8.62) 
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where 0 < p < 1 is a constant determining the effective memory of the 
algorithm. The case p — 1, is the infinite-memory case, as before. When 
p < 1 the algorithm will have an effective memory of r = — l/log(p) 
1/(1 - p) data points. The exponentially weighted RLS algorithm can now 

be summarized as: 

Algorithm 8.8: RLS Algorithm 

The exponentially weighted RLS algorithm to adjust the L coefficients of 
an adaptive filter uses the following steps: 

1) Initialize x = / = [0,0, . . . ,0] T and R”^[0] = SI. 

2) Accept a new pair of input samples {a?[ra+l], d[n- 1-1]} and shift x[n-\-l] 
input the reference signal vector a;[n+l]. 

3) Compute the output signal of the FIR filter, via 

y[n + 1] = f T [n]x[n + 1]. (8.63) 

4) Compute the a priori error function with 

e[n + 1] = d[n + 1] — y[n + 1]. (8.64) 

5) Compute the Kalman gain factor with 

k[n + 1] = R~l[n + 1 \x[n + 1]. (8.65) 

6) Update the filter coefficient according to 

f[n + 1] = = f[n] + k[n + 1 ]e[n + 1]. (8.66) 

7) Update the filter inverse autocorrelation matrix according to 

fc[n + l]fc T [w+ 1] \ 

+ x T [n + l]k[n+ 1}J ' l ' ’ 

| Next continue with step 2. 

The computational cost of the RLS are (3L 2 + 9L)/2 multiplications and 
(3L 2 + 5L)/2 additions or subtractions, per input sample, which is still more 
essential than the LMS algorithm. The advantage as we will see in the fol- 
lowing example will be a higher rate of convergence and no need to select the 
step size p, which may at times be difficult when stability of the adaptive 
algorithm has to be guaranteed. 

Example 8.9: RLS Learning Curves 

In this example we wish to evaluate a configuration called system identifi- 
cation to compare RLS and LMS convergence. We have used this type of 
performance evaluation already for LMS ADF in Sect. 8.3.1 (p. 381) The 
system configuration is shown in Fig. 8.12 (p. 381). The adaptive filter has 
a length of L = 16, the same length as the “unknown” system, whose co- 
efficients have to be learned. The additive noise level behind the “unknown 
system” has been set to —48 dB equivalent for an 8-bit quantization. For 
the LMS algorithm the eigenvalue ratio (EVR) is the critical parameter that 
determines the convergence speed, see (8.25), p. 379. In order to generate a 
different eigenvalue ratio we use a white Gaussian noise source with a 2 = 1 
that is filtered by a FIR type filter shown in Table 8.1 (p. 383). The coef- 
ficients are normalized to h[k] 2 = 1, so that the signal power does not 
change. The impulse response of the unknown system is an odd filter with 
coefficients 1, —2,3, —4, . . . , —3, 2, —1 as shown in Fig. 8.28a. The step size 
for the LMS algorithm has been determined with 



«^[« + i] = 



R xx [ n } + 
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(a) unknown system (b) RLS EVR=100 





Fig. 8.28. Simulation of the L — 16 tap adaptive filter system identification, (a) 
Impulse response of the “unknown system.” (b) RLS coefficient learning curves for 



EVR = 100. 



P max 



3xLx£{x ! ] 24' ^ 8 ' 68 ) 

In order to guarantee perfect stability the step size for the LMS algorithm has 
been chosen to be p, = /i m ax/2 = 1/48. For the transform domain DCT-LMS 
algorithm a power normalization for each coefficient is used, see Fig. 8.17 
(p. 389). From the simulation results shown in Fig. 8.29 it can be seen that 
the RLS converges faster than the LMS with increased EVR. DCT-LMS 
converges faster than LMS and in some cases quite as fast as the RLS algo- 
rithm. The DCT-LMS algorithm has less good performance when we look at 
the residue-error level and consistency of convergence. For higher EVR the 
RLS performance is better for both level and consistency of convergence. For 
EVR=1 the DCT-LMS reaches the value in the 50 dB range, but for EVR = 
100 only 40 dB are reached. The RLS converges below the system noise. | 8.9 | 



8.6.2 Fast RLS Kalman Implementation 

For the least-quare FIR fast Kalman algorithm first presented by Ljung et 
al. [242] the concept of single-step linear forward and backward prediction 
play a central role. Using these forward and backward coefficients in an all- 
recursive, one-dimensional Levison-Durbin type algorithm it will be possible 
to update the Kalman gain vector with only an 0(L ) type effort. 

A one-step forward predictor is presented in Fig. 8.30. The predictor es- 
timates the present value x\n] based on its L most recent past values. The a 
posteriori error in the prediction is quantified by 

e£[rc] = x \ n \ — x \. n \ — x [ n ] “ a T [n]xL[n — 1]. (8.69) 

The superscript indicates that it is the forward prediction error, while the 
subscript describes the order (i.e., length) of the predictor. We will drop the 
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(a) EVR=1 (b) EVR=10 




Fig. 8.29. Simulation results for a L = 16-tap adaptive filter system identification. 
Learning curve J for LMS, transform domain DCT-LMS, and RLS with 77“^ [0] = I 
(a) EVR = 1 . (b) EVR = 10. (c) EVR = 100. (d) EVR = 1000. 



index L and the vector length should be L for the remainder of this section, 
if not otherwise noted. It is also advantageous to compute also the a prion 
error that is computed using the filter coefficient of the previous iteration. 



e£[n] = x[n\ — a T [n — 1 ]xl[u - 1]. 

The least-quare minimum of [n] can be computed via 

d(CM ) 2 



da T [n] 



= — E{(x[s] — a 1 [s]x[n])x[n — s]} = 0 
for s = 1, 2, . . . , L. 



(8.70) 



(8.71) 



This leads again to an equation with the ( L x L) autocorrelation matrix, but 
the right-hand side is different, from the Wiener-Hopf equation: 

n 

Rw[n ~ l]a[n] = r^[n] = — lj^fs]. 



(8.72) 
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Fig. 8.30. Linear forward prediction of order L. 



The minimum value of the cost function is given by 

a* [n] — rl[n ] — a T [n]r^ [n], (8.73) 

n 

where [n] = ^ x[s } 2 . 

5=0 

The important fact about this predictor is now that the Levinson-Durbin 
algorithm can solve the least-quare error minimum of (8.69) in a recursive 
fashion, without computing a matrix inverse. To update the predictor coef- 
ficient we need the same Kalman gain factor as in (8.66) for updating the 
filter coefficients, namely 

a L [n + 1] = = a L [n\ + k L [n\e f L [n]. 



We will see later how the linear prediction coefficients can be used to 
iteratively update the Kalman gain factor. In order to take advantage of the 
fact that the data vectors from one iteration to the next only differ in the 
first and last element, we use an augmented- by-one version ^l+iM of the 
Kalman gain update equation (8.65) which is given by 

>_ L+i[ n + l ] x L+i[n + 1]. 

x[n + 1] 
x L [n] 



kL+i[n + 1 ] = R x 



r L[ n + 1 ] 


fT 

r L 


[n + 1] " 


r{[n\ 


J 



(8.74) 

(8.75) 



In order to compute the matrix inverse of R x x i+1 [n] we use a well-known 
theorem of matrix inversion of block matrices, i.e., 



M- 



A 


B 


C 


D 



(8.76) 



— (AD~ 1 C — A) -1 


(AD~ 1 C - A)~ 1 BD~ l 1 


D~ l C - (AD~ 1 C - A)- 1 


D~ l 


- (D~ 1 CBD~ 1 )(AD~ 1 C - A) -1 J 



if D- 1 is nonsingular. We now make the following associations: 

A = rl L [n + 1] B = t*£ t [n + 1] 

C = r{[n ] D = R Z i[nl 



we then get 
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Fig. 8.31. Linear backward prediction of order L. 



D l C = R x i tL [n]r{[n} = a L [n + 1] 

BD- 1 = r{ T [n + l]jR“i[n] = a T L [n + 1] 

-(AD 'C - A)" 1 = -r f L T [n+l}R;l L [n}r{[n] + r{ L [n+ 1] 

= r Li n + !] - a li n + !]d>] = a {l n + !]• 

We can now rewrite + 1] fr° m (8.74) as 



1 


al[n + l] 


a{[n + l] 


a f r \n + l] 


a L [n + l] 


p-1 [ 1 , a L [n + l]a'l'[n + l] 


. <*£[n+i] 


■ K xx,lN+ a f L [n + 1] 



R xL,L + ll n + !] - 



After some rearrangements (8.74) can be written as 



k L+ i[n + 1 ] 



k L [n + 1 ] 

9L [n + 1 ] 
1L [n + 1 ] 



+ 1 ] [ 1 

a{[n + 1] .«£.[« + 1]_ 



(8.77) 



Unfortunately, we do not have a closed recursion so far. For the iterative 
update of the Kalman gain vector, we need besides the forward prediction 
coefficients, also the coefficients of the one-step backward predictor, whose a 
posteriori error function is 

e b [n ] = x[n — L] — x[n — L] = x[n — L] — b T [n\x[n\, (8.78) 



again all vectors are of size ( L x 1). The linear backward predictor is shown 
in Fig. 8.31. 

The a prion error for the backward predictor is given by 
e b L [n\ = x[n — L] — b T [n — 1 ]xl[ti], 



The iterative equation to compute the least-quare coefficients for the back- 
ward predictor is equivalent to the forward case and given by 

n 

R xx [n\b[n] = r b [n\ = L], 

s = 0 



(8.79) 
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and the minimum value for the total squared error becomes 
a^[n] = Tq[u] — b T [n\r b [n], 

where r§[n] = Y^=o — T] 2 - To update the backward predictor coefficient 
we need again the Kalman gain factor in (8.66) as for the updating of the 
filter coefficients, namely 

b L [n + 1] = = b L [n] + k L [n + 1 ]e b L [n + 1], 

Now we can again find a Levinson-Durbin type of recursive equation for the 
extended Kalman gain vector, only this time using the backward prediction 
coefficients. It follows that 



k L +i[n+ 1] = Rj tL+1 [n+ l]x L +i [n+ 1]. (8.80) 

_ ’ R X x,L[n]\r b L [n + 1} ] -1 [ x L [n + 1] 1 . 

r b L r [n] \r^ L [n+l]\ [;r[n - L + 1] J ' 

To solve the matrix inversion, we define as in (8.76) a (L -f 1) x (L + 1) block 
matrix M , only this time the block A needs to be nonsingular and it follows 
that 



x L [n + 1] 
x[n — L + 1] 




A- 1 - (A~ x BC A~ 1 )(C A~ x B - D)~ 1 A~ 1 B(CA~ 1 B - D)~ l 
(CA^B - D)~ 1 C A~ l -{CA^B - D)~ l 

We now make the following associations: 



A — Bxx,l\ji\ 
C = rf M 



r b L [n+ 1] 



C = r^ T [n] D = r b 0L [n + 1], 
we then get the following intermediate results 



A l B = R^ iL [n]r b L [n+ 1] = b L [n + 1] 

CA - 1 = rf [n + 1 ]R~i >L [n] = b T L [n + 1] 

-(CA~ l B - D) = -b T L [n+ 1 ]r b L [n + 1] + r b 0L {n+ 1] = a b L [n + 1], 



LIsing this intermediate results in (8.78) we get 




After some rearrangements (8.80) can now, using the backward prediction 
coefficients, be written as 



kL+i[n + 1 ] = 



k L [n + 1] 

0 



e b L [n + 1] [n + 1] 
a b L [n + 1] [ 1 



9L{n + 1 ] 

. 1L [n + 1] . 




416 



8. Adaptive Filters 



The only iterative update equation missing so far is for the minimum values 
of the total square errors, which is given by 

a{[n + 1] = a[[n ] + e[[n + 1 ]e[[n + 1] (8.82) 

a b L [n + 1] = a b L [n ] + e b L [n + 1 }e b L [n + 1], (8.83) 

We now have all iterative equations available to define the 

Algorithm 8.10: Fast Kalman RLS Algorithm 

The prewindowed fast Kalman RLS algorithm to adjust the L filter coef- 
ficients of an adaptive filter uses the following steps: 

1) Initialize x = a = b = / = k = [0, 0, . . . , 0] T and a f = a b = S 



2 ) Accept a new pair of input samples {.r[n + 1], d[n 

3) Compute now the following equations to update 
quential order 

e£[n + 1] = x\n + 1] — [n] [n] 

a>L[n + 1] = a L [n] + k L [n]e f L [n + 1] 
e{[n + 1] = x[n + 1] - a T [n + 1 \x L [n\ 
a f L [n + 1] = a{[n] + e{[n+ 1 ]e[[n + 1] 



&L+i[n + 1] = 



f !]}• 

a, 6, and k in se- 



[k L [n-\- 1]J a{[n + l] L a ^[ n + 1 ]J 

_ gi[n + 1]" 

1.7 L[n+ 1] J 

e b L [n + 1] = x [n + 1 - L\ - b T [n]x L [n + 1] 

k L [n + 1] = jlAn + l]-u[n+ l ]b T [n] 

1 + lL[n + l]e^[n + 1] 
b L [n + 1] = b L [n] + k L [n + 1 ]e b L [n + 1]. 

4) Shift the x[n + 1] in the reference signal vector ®[n+l] and compute 
the following two equations in order to update the adaptive filter 
coefficients: 

cl [n + 1] = d[n + 1 ]~ fi [n]x L [n + 1] 

fL[n + 1] = /i[n] + k L [n + 1 ]e L [n + 1]. 

Next continue with step 2. 

Counting the computational effort we find that step 3 needs 2 divisions, 
8T + 2 multiplications, and 7-L + 2 add or subtract operations. The coefficient 
update in step 4 uses an additional 2 L multiply and add/subtract opera- 
tions, that the total computational effort is 10L j 2 multiplications, 9L + 2 
add/subtract operations and 2 divisions. 



8.6.3 The Fast a Posteriori Kalman RLS Algorithm 

A careful inspection of Algorithm 8.10 reveals that the original fast Kalman 
algorithm as introduced by Ljung et al. [242] is mainly based on the a priori 
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error equations. In the fast a posteriori error sequential technique (FAEST) 
introduced by Carayannis et al. [243] to a greater extent the a posteriori 
error is used. The algorithm explores even more the iterative nature of the 
different parameters in the fast Kalman algorithm, which will reduce the com- 
putational effort by an additional 2 L multiplications. Otherwise, the original 
fast Kalman and the FAEST use mainly the same ideas, i.e., extended by one 
length Kalman gain, and the use of the forward and backward predictions a 
and b. We also introduce the forgetting factor p. The following listing shows 
the inner loop of the FAEST algorithm in MatLab: 



y # ********* FAEST Update of k, a, and b 

ef=xin - a’*x; 7, a priori forward prediction error 

ediva=ef /(rho*af ) ; 7* a priori forward error/minimal error 

ke(l)=-ediva; 7, extended Kalman gain vector update 

ke(2:l+l)=k - ediva*a;7« split the 1+1 length vector 
epsf=ef*psi; 7« a posteriori forward error 

a=a+epsf*k; 7# update forward coefficients 

k=ke(l:l) + ke(l+l).*b; 7* Kalman gain vector update 

eb=-rho*alphab*ke(ll) ; 7* a priori backward error 

alphaf =rho*alphaf+ef *epsf ; 7* forward minimal error 

alpha=alpha+ke(l+l) *eb+ediva*ef ; 7* prediction crosspower 
psi=l . 0/alpha; 7« psi makes it a 2 div algorithm 

epsb=eb*psi; 7. a posteriori backward error update 

alphab=rho*alphab+eb*epsb; 7* minimum backward error 
b=b-k*epsb; 7* update backward prediction coefficients 

x=[xin;x(l : 1-1)] ; 7, shift new value into filter taps 
°/ # ******** Time updating of the LS FIR filter 
e=din-f , *x; 7o error: reference - filter output 

eps=-e*psi; 7 # a posteriori error of adaptive filter 

f=f+w*eps; 7. coefficient update 



The total effort (not counting the exponential weight with p) is 2 divisions, 
7L + 8 multiplications and 7L + 4 additions or subtractions. 



8.7 Comparison of LMS and RLS Parameters 

Finally, Table 8.3 compares the algorithms we have introduced in this chap- 
ter. The table shows a comparison in terms of computation complexity for the 
basic stochastic gradient (SG) methods like signed LMS (SLMS), normalized 
LMS (NLMS) or block LMS (BLMS) algorithm using a FFT. Transform- 
domain algorithms are listed next, but the effort does not include the power 
normalization, i.e., L normalizations in the transform domain. From the RLS 
algorithms we have discussed the (fast) Kalman algorithm and the FAEST 
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Table 8.3. Complexity comparison for LMS and RLS algorithms for length - L 
adaptive filter. TDLMS without normalization. Add L multiplications and 2 L 
add/subtract and L divide, if normalization is used in the TDLMS algorithms. 



Algorithm 


Implementation 


Computational load 
mult add/sub 


div 


SG 


LMS 


2 L 


2 L 


_ 




SLMS 


L 


2 L 


- 




NLMS 

BLMS (FFT) 


2L + 1 

10log 2 (L) + 8 


2L + 2 

15 log 2 )L) +30 


1 


SG 


Hadamard 


2 L 


4^ 

1 

to 


_ 


TDLMS 


Haar 


2 L 


2 L + 21og 2 (L) 


- 




DCT 


2L+f log 2 (L) + L 


2L+f log 2 (i) 
2L+f log 2 (L) 


- 




DFT 


2L+f log 2 (L) 


- 




KLT 


2 L + L 2 + L 


2L + 2 L 


- 


RLS 


direct 


2 L 2 + 4 L 


21? +2L-2 


2 




fast Kalman 


10 L + 2 


9L + 2 


2 




lattice 


8 L 


8 L 


6L 




FAEST 


7L + 8 


7 L + 4 


2 



algorithm. Lattice algorithm (not discussed) in general, require a large num- 
ber of division and square root computations and it has been suggested to 
use the logarithmic number system (see Chap. 2, p. 41) in this case [244]. 

The data in Table 8.3 are based on the discussion in Chap. 6 of DCT and 
DFT and their implementation using fast DIF or DIT algorithms. For DCT or 
DFT of length 8 and 16 more efficient (Winograd-type) algorithms have been 
developed using even fewer operations. A length-8 DCT (see Fig. 6.18, p. 282), 
for instance, uses 12 multiplications and a DCT transform domain algorithm 
can then be implemented with 2x8+12 = 28 multiplications, which compares 
to the FAEST algorithms 7 x 8+8 = 64. But this calculation does not take into 
account that a power normalization is mandatory for all TDLMS (otherwise 
there is no fast convergence compared with the standard LMS algorithm 
[232, 233]). The effort for the division may be larger than the multiplication 
effort. When the power normalization factor can be determined beforehand it 
may be possible to implement the division with hardwired scaling operations. 
FAEST needs only 2 divisions, independent of the ADF length. 

A comparison of the RLS and LMS adaptation speed was presented in Ex- 
ample 8.9 (p. 410), which shows that RLS-type algorithms adapt much faster 
than the LMS algorithm, but the LMS algorithm can be improved essentially 
with transform-domain algorithms, like the DCT-LMS. Also, error floor and 
consistency of the error is, in general, better for the RLS algorithm, when 
compared with LMS or TDLMS algorithms. But none of the RLS-type algo- 
rithms can be implemented without division operations, which will require 
usually a larger overall system bit width, at least a fractional number repre- 
sentation, or even a floating-point representation [244]. The LMS algorithm 
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on the other hand, can be implemented with only a few bits as presented in 
Example 8.5 (p. 392). 



Exercises 

8 . 1 : Suppose the following signal is given 
a;[n] = Acos[27rn/T + <j>\. 

(a) Determine the power or variance cr 2 . 

(b) Determine the autocorrelation function r xx [r\. 

(c) What is the period of r xx \r]l 

8 . 2 : Suppose the following signal is given 
#[rc] = Asin[27rn/T + </>] + n[n], 

where n[n] is a white Gaussian noise with variance a 2 n . 

(a) Determine the power or variance cr 2 of the signal a;[n] 

(b) Determine the autocorrelation function r xx [r]. 

(c) What is the period of r xx [r ]? 

8 . 3 : Suppose the following two signals are given: 
j;[rc] cos[27rn/To] y[n] = cos[27rn/Ti]. 

(a) Determine the cross-correlation function r xy \r\. 

(b) What is the condition for To and T\ that r xy [r\ = 0? 

8 . 4 : Suppose the following signal statistics have been determined: 

R xx = H r dx = 5 Rdd{ 0] = 20 . 

Compute 

(a) Compute RZx • 

(b) The optimal Wiener filter weight. 

(c) The error for the optimal filter weight. 

(d) The eigenvalues and the eigenvalue ratio. 

8 . 5 : Suppose the following signal statistics for a second-order system are given: 
R xx = ro ri r dx = R dd [ 0] = <Td- 

Tl Vq C i 

The optimal filter with coefficient should be /o and / 1 . 

(a) Compute R X x . 

(b) Determine the optimal filter weight error as a function of fo and / 1 . 

(c) Determine fo and / 1 as a function of r and c. 

(d) Assume now that r i = 0. What are the optimal filter coefficients fo and / 1 ? 
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8.6: Suppose the desired signal is given as: 
d[n\ = cos[27rn/To]. 

The reference signal x[n] that is applied to the adaptive filter input is given as 
x[rt] = sin[27m/To] + 0.5 cos[27rra/Ti], 

where To = 5 and T\ — 3. Compute for a second-order system: 

(a) Rxx 5 rdx . and i?dd[0]. 

(b) The optimal Wiener filter weight. 

(c) The error for the optimal filter weight. 

(d) The eigenvalues and the eigenvalue ratio. 

(e) Repeat (a)-(d) for a third-order system. 

8.7: Suppose the desired signal is given as: 
d[n] = cos[27rn/To] + n[n], 

where n[n] is a white Gaussion noise with variance 1. The reference signal :r[n] that 
is applied to the adaptive filter input is given as 

x[n] = sin[27rn/To], 

where To = 5. Compute for a second-order system: 

(a) Rxx j f* dx • and -RcEd[0]. 

(b) The optimal Wiener filter weight. 

(c) The error for the optimal filter weight. 

(d) The eigenvalues and the eigenvalue ratio. 

(e) Repeat (a)-(d) for a third-order system. 

8 . 8 : Suppose the desired signal is given as: 
d[n] = cos[47rn/To] 

where ra[n] is a white Gaussian noise with variance 1. The reference signal :r[n], 
which is applied to the adaptive filter input, is given as 

:r[ra] = sin[27rn/To] — cos[47rn/To], 

with To = 5. Compute for a second-order system: 

(a) Rxx i y dx • and Ftdd[0]. 

(b) The optimal Wiener filter weight. 

(c) The error for the optimal filter weight. 

(d) The eigenvalues and the eigenvalue ratio. 

(e) Repeat (a)-(d) for a third-order system. 

8.9: Using the 4 FIR filters given in Sect. 8.3.1 (p. 381) use C or MatLab to compute 
the autocorrelation function and the eigenvalue ratio using the autocorrelation for 
of a (filtered) sequence of 10 000 white noise samples. For the following system 
length (i.e., size of autocorrelation matrix): 

(a) L = 2. 

(b) L = 4. 

(c) L = 8. 

(d) L = 16. 

Hint: The MatLab functions: randn, filter, xcorr, toeplitz, eig are helpful. 

8.10: Using an HR filter with one pole 0 < p < 1 use C or MatLab to compute the 
autocorrelation function and plot the eigenvalue ratio using the autocorrelation for 
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a (filtered) sequence of 10 000 white noise samples. For the following system length 
(i.e., size of autocorrelation matrix): 

(a) L = 2. 

(b) L = 4. 

(c) L = 8. 

(d) L = 16. 

(e) Compare the results from (a) to (d) with the theoretical value EVR = (1 + 
p)/( 1 — p )) 2 of Markov- 1 processes [230]. 

Hint: The MatLab functions: randn, filter, xcorr, toeplitz, eig are helpful. 

8 . 11 : Using the FIR filter for EVR = 1000 given in Sect. 8.3.1 (p. 381) use C or 
MatLab to compute the eigenvectors of the autocorrelation for L — 16. Compare 
the eigenvectors with the DCT basis vectors. 

8 . 12 : Using the FIR filter for EVR = 1000 given in Sect. 8.3.1 (p. 381) use C or 
MatLab to compute the eigenvalue ratios of the transformed power normalized 
autocorrelation matrices from (8.45) on page 390 for L — 16 using the following 
transforms: 

(a) Identity transform (i.e., no transform). 

(b) DCT. 

(c) Hadamard. 

(d) Haar. 

(e) Karhunen-Loeve. 

(f) Build a ranking of the transform from (a)-(e). 



8 . 13 : Using the onepole HR filter from Exercise 8.10 use C or MatLab to compute 
for 10 values of p in the range 0.5 to 0.95 the eigenvalue ratios of the transformed 
power normalized autocorrelation matrices from (8.45) on page 390 for L = 16 
using the following transforms: 

(a) Identity transform (i.e., no transform). 

(b) DCT. 

(c) Hadamard. 

(d) Haar. 

(e) Karhunen-Loeve. 

(f) Build a ranking of the transform from (a)-(e). 



8 . 14 : Use C or MatLab to rebuild the power estimation shown for the nonstation- 
ary signal shown in Fig. 8.15 (p. 385). For the power estimation use 

(a) Equation (8.38) page 383. 

(b) Equation (8.41) page 386 with f3 — 0.5. 

(c) Equation (8.41) page 386 with (3 — 0.9. 



8 . 15 : Use C or MatLab to rebuild the simulation shown in Example 8.1 (p. 373) 
for the following filter length: 

(a) L=2. 

(b) L=3. 

(c) L=4. 

(d) Compute the exact Wiener solution for L=3. 

(e) Compute the exact Wiener solution for L=4. 

8 . 16 : Use C or MatLab to rebuild the simulation shown in Example 8.3 (p. 380) 
for the following filter length: 
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(a) L=2. 

(b) L=3. 

(c) L=4. 

8.17: Use C or MatLab to rebuild the simulation shown in Example 8.6 (p. 399) 
for the following pipeline configuration: 

(a) DLMS with 1 pipeline stages. 

(b) DLMS with 3 pipeline stages. 

(c) DLMS with 6 pipeline stages. 

(d) DLMS with 8 pipeline stages. 



Exercises Using MaxPlusII 

8.18: (a) Change the filter length of the adaptive filter in Example 8.5 (p. 392) to 
three. 

(b) Make a functional compilation (with the MaxPlusII compiler) of the HDL code 
for the filter. 

(c) Perform a functional simulation of the filter with the inputs d[n\ and r[n], 

(d) Compare the results with the simulation in Exercise 8.15b and d. 

8.19: (a) Change the filter length of the adaptive filter in Example 8.5 (p. 392) to 
four. 

(b) Make a functional compilation (with the MaxPlusII compiler) of the HDL code 
for the filter. 

(c) Perform a functional simulation of the filter with the inputs c/[n] and x[n]. 

(d) Compare the results with the simulation in Exercise 8.15c and e. 

8.20: (a) Change the DLMS filter design from Example 8.6 (p. 399) pipeline of 
e[n\ only, i.e. DLMS with 1 pipeline stage. 

(b) Make a functional compilation (with the MaxPlusII compiler) of the HDL 
code for the filter. 

(c) Perform a functional simulation of the filter with the inputs d[n] and x[n]. 

(d) Compare the results with the simulation in Exercise 8.17a. 

(e) Determine size in LCs, and Registered Performance, of your D=1 design. 

8.21: (a) Change the DLMS filter design from Example 8.6 (p. 399) pipeline of / 
update only, i.e. DLMS with 3 pipeline stages. 

(b) Make a functional compilation (with the MaxPlusII compiler) of the HDL 
code for the filter. 

(c) Perform a functional simulation of the filter with the inputs d[n] and r[n]. 

(d) Compare the results with the simulation in Exercise 8.17b. 

(e) Determine size in LCs, and Registered Performance, of your D=3 design. 



8.22: (a) Change the DLMS filter design from Example 8.6 (p. 399) pipeline 
with an optimal number of stages, i.e. DLMS with 8 pipeline stages, 3 for each 
multiplier and one stage each for e[n] and y[n\. 

(b) Make a functional compilation (with the MaxPlusII compiler) of the HDL 
code for the filter. 

(c) Perform a functional simulation the filter with the inputs d[n] and x[n]. 

(d) Compare the results with the simulation in Exercise 8.17d. 

(e) Determine size in LCs, and Registered Performance, of your D=8 design. 
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A. Verilog Source Code 



//$$%********************’i‘*%’l‘*****:|‘>l'*’l'’t'>l'**********>l‘’l‘>l‘’l‘>l'*>l'>l'* 

// IEEE STD 1364-1995 Verilog file: example. v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 
/ / + 

//‘include "220model.v M // Using predefined components 

module example (elk, a, b, opl, sum, d) ; // > Interface 

parameter WIDTH =8; // Bit width 

input elk; 

input [WIDTH- 1:0] a, b, opl; 
output [WIDTH-1:0] sum, d; 

wire [WIDTH- 1:0] c; // Auxiliary variables 

reg [WIDTH-1 :0] s; // Infer FF with always 

wire [WIDTH- 1 : 0] op2, op3; 

wire elkena, ADD, ena, aset, sclr, sset, aload, sload, 

aclr, ovfl, cinl; // Auxiliary 1pm signals 

// Default for add: 

assign cinl=0; assign aclr=0; assign ADD=1; 

assign ena=l; assign aclr=0; assign aset=0; 
assign sclr=0; assign sset=0; assign aload=0; 
assign sload=0; assign clkena=0; // Default for FF 

assign op2 = b; // Only one vector type in Verilog; 

// no conversion int -> logic vector necessary 

// Note when using 220model.v ALL component's signals 
// must be defined, default values can only be used for 
// the parameters . 
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lpm_add_sub addl // > Component instantiation 

( . result (op3) , .dataa(opl), . datab(op2) ) ; // Used ports 
// . cin(cinl) , . cout (crl) , . add_sub(ADD) , . clken(clkena) , 

// .clock(clk), . overf low(ovll) , . aclr (aclr) ) ; // Unused 
defparam addl . lpm_width = WIDTH; 
defparam addl . lpm_representat ion = "SIGNED"; 

lpm_ff regl 

( .data(op3), .q(sum), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// .sset(sset), . aload(aload) , . sload(sload) ) ; // Unused 
defparam regl . lpm_width = WIDTH; 

assign c = a + b; // > Continuous assignment statement 

always ©(posedge elk) // > Behavioral style 

begin : pi // Infer register 

s = c + s; // Signal assignment statement 

end 

assign d = s; 
endmodule 

//********************************************************* 
// IEEE STD 1364-1995 Verilog file: fun_text.v 
// Author-EMAIL : Uwe.Meyer-BaeseOieee.org 
//********************************************************* 
// A 32 bit function generator using accumulator and ROM 
//' include "220model.v" 

module fun_text (M, sin, acc, elk); // > Interface 

parameter WIDTH =32; // Bit width 

input [WIDTH- 1:0] M; 
output [7:0] sin, acc; 
input elk; 

wire [WIDTH- 1:0] s, acc32; 

wire [7:0] msbs; // Auxiliary vectors 

wire ADD, ena, aset, sclr, sset; // Auxiliary signals 
wire aload, sload, aclr, ovfl, cinl, elkena; 

// Default for add: 

assign clkena=0; assign cinl=0; assign ADD=1; 

//default for FF: 
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assign ena=l; assign aclr=0; assign aset=0; assign sclr=0; 
assign sset=0; assign aload=0; assign sload=0; 

lpm_add_sub add_l // Add M to acc32 

( . result (s), . dataa(acc32) , .datab(M)); // Used ports 
// .cout(crl), . add_sub(ADD) , . overt low(ovll ) , // Unused 

// . clock(clk) , . cin(cinl) , . clken(clkena) , . aclr (aclr) ) ; 

// 

defparam add_l . lpm_width = WIDTH; 

defparam add_l . lpm_representation = "UNSIGNED”; 

lpm_ff reg_l // Save accu 

( .data(s), .q(acc32), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), // Unused ports 

// .sset(sset), . aload(aload) , . sload(sload) , . sclr(sclr) ) ; 
defparam reg_l . lpm_width = WIDTH; 

assign msbs = acc32 [WIDTH- 1 : WIDTH-8] ; 
assign acc = msbs; 

lpm_rom roml 

( .q(sin), . inclock(clk) , . outclock(clk) , 

. address (msbs) ) ; // Used ports 
// .memenab(ena) ) ; // Unused port 

defparam roml . lpm_width = 8; 
defparam roml . lpm_widthad = 8; 
defparam roml . lpm_f ile = "sine.mif"; 

endmodule 

// IEEE STD 1364-1995 Verilog file: add.lp.v 
// Author-EMAIL: Uwe.Meyer-Baese@ieee.org 

f ! 3(C3|C3|e3|C34C3|C3(C3|C34C3(C34c3(C34c3te3fe3|e3|e3|e3|C3(C3(C34C3fC3tC3|e3|C3tC3te3{C3fC3fC3|C3|C3(C3(e3|C3fe3|e9|e3fe3|C3fC3|C3(C3|C34C3|C3te3tC34C3|C3|C3|C3|e3|C3|C3|C 

// ( include "220model.v" 

module add_lp (x, y, sum, elk); 

parameter WIDTH = 15, // Total bit width 

WIDTH1 =7, // Bit width of LSBs 

WIDTH2 =8; // Bit width of MSBs 

input [WIDTH- 1:0] x,y; // Inputs 

output [WIDTH- 1:0] sum; // Result 

input elk; // Clock 
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reg [WIDTH1-1 : 0] 11, 
wire [WIDTH1-1 : 0] ql , 
reg [WIDTH2-1 : 0] 13, 
wire [WIDTH2-1 : 0] r2, 
reg [WIDTH- 1:0] s; 
wire crl,cql; // LSBs 
wire [WIDTH2-1 : 0] h2; 



12; // LSBs of inputs 

rl; // LSBs of inputs 

14; // MSBs of input 

q2, u2; // MSBs of input 

// Output register 
carry signal 

// Auxiliary MSBs of input 



wire clkena, ADD, ena, aset, sclr; // Auxiliary signals 
wire sset, aload, sload, aclr, ovfl, cinl; 



// Default for add: 

assign cinl=0; assign aclr=0; assign ADD=1; 

assign ena=l; assign aclr=0; // Default for FF 

assign sclr=0; assign sset=0; assign aload=0; 
assign sload=0; assign clkena=0; assign aset=0; 

// Split in MSBs and LSBs and store in registers 
always @(posedge elk) begin 
// Split LSBs from input x,y 
11 [WIDTH1-1 : 0] <= x [WIDTH1-1 : 0] ; 

12 [WIDTH1-1 : 0] <= y [WIDTH1-1 : 0] ; 

// Split MSBs from input x,y 

13 [WIDTH2-1 : 0] <= x [WIDTH2-1+WIDTH1 : WIDTH1] ; 

14 [WIDTH2-1 : 0] <= y [WIDTH2-1+WIDTH1 : WIDTH1] ; 
end 

/************* First stage of the adder *****************/ 
lpm_add_sub add_l // Add LSBs of x and y 

( .result(rl), .dataa(ll), .datab(12), . cout(crl)); 

// Used ports 

// . overf low(ovll) , . clken(clkena) , . add_sub(ADD) , 

// .cin(cinl), .clock(clk), . aclr (aclr )) ; // Unused ports 
defparam add_l . lpm_width = WIDTH1; 
defparam add_l . lpm_direction = "add"; 

lpm_ff reg_l // Save LSBs of x+y 

( .data(rl), .q(ql), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// .sset(sset), . aload(aload) , . sload(sload) ) ; // Unused 
defparam reg_l . lpm_width = WIDTH1; 



lpm_ff reg_2 



// Save LSBs carry 
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( .data(crl), .q(cql), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// .sset(sset), . aload(aload) , . sload(sload) ) ; // Unused 
defparam reg_2 . lpm_width = 1; 

lpm_add_sub add_2 // Add MSBs of x and y 

( .dataa(13), .datab(14), .result(r2) ); // Used ports 

// . add_sub(ADD) , . cout (coutl) , .cin(cinl), . clken(clkena) , 

// . overf low(ovll) , .clock(clk), . aclr (aclr) ) ; // Unused 

defparam add_2 . lpm_width = WIDTH2; 
defparam add_2 . lpm_direction = M add"; 

lpm_ff reg_3 // Save MSBs of x+y 

( .data(r2), .q(q2), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// .sset(sset), . aload(aload) , . sload(sload) ) ; // Unused 
defparam reg_3 . lpm_width = WIDTH2; 



/************** Second stage of the adder ****************/ 
// One operand is zero 

assign h2 = {WIDTH2{1 ’bO}} ; 

lpm_add_sub add_3 // Add MSBs (x+y) and carry from LSBs 
( .cin(cql), .dataa(q2), .datab(h2), .result (u2) ) ; 

// Used ports 

// . cout (cout 1) , . overf low(ovll) , . clken(clkena) , // Unused 

// . add_sub(ADD) , .clock(clk), . aclr (aclr) ) ; // ports 

defparam add_3 . lpm_width = WIDTH2; 
defparam add_3 . lpm_direction = "add"; 

always @(posedge elk) begin // Build a single registered 
s = {u2 [WIDTH2- 1:0] , ql [WIDTH1- 1 : 0] } ; // output word 

end // of WIDTH=WIDTH1+WIDTH2 

assign sum = s ; // Connect s to output pins 

endmodule 

// IEEE STD 1364-1995 Verilog file: add_2p.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 
//^+************************************** ******** ********* 
// 22-bit adder with two pipeline stages 
// uses four components: csa7.v; csa7cin.v; 
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// 



add_ff8.v; add_ff8cin.v 



//‘ include "220model . v" 
//include "csa7.v" 
//‘include "csa7cin.v" 
//‘include M add_ff8.v" 
//‘include M add_f f 8cin . v M 

module add_2p (x, y, sum, 
parameter WIDTH = 22, 
WIDTH1 = 7, 
WIDTH2 = 7, 
WIDTH12 = 14, 
WIDTH3 = 8 ; 

input [WIDTH- 1 : 0] x,y; 
output [WIDTH-1:0] sum; 
input elk; 



elk); 

// Total bit width 
// Bit width of LSBs 
// Bit width of middle s 
// Sum WIDTH 1+WIDTH2 
// Bit width of MSBs 

// Inputs 
// Result 
// Clock 



reg [WIDTH1-1 : 0] 11, 12; // LSBs of inputs 

wire [WIDTH1-1:0] ql, vl, si; // LSBs of inputs 

reg [WIDTH2-1 : 0] 13, 14; // Middle bits 

wire [WIDTH2-1 : 0] q2, h2, v2, s2; // Middle bits 

reg [WIDTH3-1 : 0] 15, 16; // MSBs of input 

wire [WIDTH3-1:0] q3, h3, v3, s3; // MSBs of input 

wire [WIDTH-1 :0] s; // Output register 

wire cql, cq2, cv2; // Carry signals 

wire ena, aset, sclr, sset, aload, sload, aclr; 

// Auxiliary FF signals 
assign ena=l; assign aclr=0; assign aset=0; 
assign sclr=0; assign sset=0; assign aload=0; 
assign sload=0; // Default for FF 

// Split in MSBs and LSBs and store in registers 
always @(posedge elk) begin 

// Split LSBs from input x,y 

11 [WIDTH1-1 : 0] <= x [WIDTH1-1 : 0] ; 

12 [WIDTH1-1 : 0] <= y [WIDTH1-1 : 0] ; 

// Split middle bits from input x,y 

13 [WIDTH2-1 : 0] <= x [WIDTH2-1+WIDTH1 : WIDTH1] ; 

14 [WIDTH2-1 : 0] <= y [WIDTH2-1+WIDTH1 : WIDTH1] ; 

// Split MSBs from input x,y 

15 [WIDTH3-1 : 0] <= x [WIDTH3-1+WIDTH12 : WIDTH12] ; 

16 [WIDTH3-1 : 0] <= y [WIDTH3-1+WIDTH12 : WIDTH12] ; 
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end 

//************** First stage of the adder **************** 
csa7 add_l // Add LSBs of x and y 

( .a(ll), . b(12) , .clock(clk), .s(ql), .c(cql)); 

csa7 add_2 // Add LSBs of x and y 

( .a(13), . b(14) , .clock(clk), .s(q2), .c(cq2) ); 

add_f f 8 add_3 // Add MSBs of x and y 

( .a(15), . b(16) , .clock(clk), .s(q3)); 

//************* Second stage of the adder ***************** 
// Two operands are zero 
assign h2 = {WIDTH2{1 >b0» ; 
assign h3 = {WIDTH3{1 ’bO}} ; 

lpm_ff reg_l // Save ql 

( .data(ql), .q(vl), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// .sset(sset), . aload(aload) , . sload(sload) ) ; // Unused 
defparam reg_l . lpm_width = WIDTH1; 

// Add result of middle bits (x+y) and carry from LSBs 
csa7cin add_4 

( . a(q2) , . b(h2) , .cin(cql), .clock(clk), . s (v2) , . c(cv2) ) ; 

// Add result of MSBs bits (x+y) and carry from middle 
add_ff8cin add_5 

( . a(q3) , . b(h3) , .cin(cq2), .clock(clk), .s(v3)); 

//************* Third stage of the adder ****************** 
lpm_ff reg_2 // Save vl 

( .data(vl), . q(sl) , . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// . sset(sset), . aload(aload) , . sload(sload) ) ; // Unused 

defparam reg_2 . lpm_width = WIDTH1; 

lpm_ff reg_3 // Save v2 

( .data(v2), .q(s2), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// . sset(sset), . aload(aload) , . sload(sload) ) ; // Unused 

defparam reg_3 . lpm_width = WIDTH1; 




442 A. Verilog Source Code 



// Add result of MSBs bits (x+y) and 2. carry from middle 
add_ff8cin add_6 

( .a(v3), .b(h3), .cin(cv2), .clock(clk), .s(s3)); 

// Build a single output word of WIDTH=WIDTH1+WIDTH2+WIDTH3 
assign s ={s3 [WIDTH3-1 : 0] , s2 [WIDTH2-1 : 0] , si [WIDTH1-1 : 0] } ; 

assign sum = s; // Connect s to output pins 

endmodule 

//********************************************************* 
// IEEE STD 1364-1995 Verilog file: add_3p.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 
//*********************************** ********************** 
// 29-bit adder with three pipeline stages 
// uses four components: csa7.v; csa7cin.v; 

// add_ff8.v; add_ff8cin.v 

//' include "220model.v M 
//' include M csa7.v" 

//include "csa7cin.v" 

//include "add_ff8.v" 

//'include "add_f f 8cin . v" 



module add_3p (x, y, 


sum, 


elk); 


parameter WIDTH 


= 


29, 


// Total bit width 




WIDTH0 


= 


7, 


// Bit width of LSBs 




WIDTH1 


= 


7, 


// Bit width of 2. LSBs 




WIDTH01 


= 


14, 


// Sum WIDTH0+WIDTH1 




WIDTH2 


= 


7, 


// Bit width of 2. MSBs 




WIDTH012 


= 


21, 


// Sum WIDTH0+WIDTH1+WIDTH2 




WIDTH3 


= 


8; 


// Bit width of MSBs 


input 


[WIDTH- 1 : 0] 


x. 


■ y; 


// Inputs 


output 


[WIDTH- 1:0] 


sum; 


// Result 


input 




elk; 


// Clock 



reg [WIDTHO-1 : 0] 10, 11; // LSBs of inputs 

wire [WIDTH0-1 : 0] qO, vO, rO, sO; // LSBs of inputs 

reg [WIDTH1-1 : 0] 12, 13; // 2. LSBs of input 

wire [WIDTH1-1:0] ql, vl, rl, si; // 2. LSBs of input 

reg [WIDTH2-1 : 0] 14, 15; // 2. MSBs bits 

wire [WIDTH2-1 : 0] q2, v2, r2, s2, h7; // 2. MSBs bits 

reg [WIDTH3-1 : 0] 16, 17; // MSBs of input 
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wire [WIDTH3-1:0] q3, v3, r3, s3, h8; // MSBs of input 
wire [WIDTH- 1:0] s; // Output register 
wire cqO, cql , cq2, cvl, cv2, cr2; // Carry signals 
wire ena, aset, sclr, sset, aload, sload, aclr; 

// Auxiliary FF signals 

assign ena=l; assign aclr=0; assign aset=0; 

assign sclr=0; assign sset=0; assign aload=0; 

assign sload=0; // Default for FF 

// Split in MSBs and LSBs and store in registers 
always @(posedge elk) begin 
// Split LSBs from input x,y 
10 [WIDTH0-1 : 0] <= x [WIDTH0-1 : 0] ; 

11 [WIDTH0-1 : 0] <= y [WIDTH0-1 : 0] ; 

// Split 2. LSBs from input x,y 

12 [WIDTH1-1 : 0] <= x [WIDTH1-1+WIDTH0 : WIDTH0] ; 

13 [WIDTH1-1 : 0] <= y [WIDTH1-1+WIDTH0 : WIDTH0] ; 

// Split 2. MSBs from input x,y 

14 [WIDTH2- 1 : 0] <= x [WIDTH2-1+WIDTH01 : WIDTH01] ; 

15 [WIDTH2-1 : 0] <= y [WIDTH2-1+WIDTH01 :WIDTH01] ; 

// Split MSBs from input x,y 

16 [WIDTH3-1 : 0] <= x [WIDTH3-1+WIDTH012 : WIDTH012] ; 

17 [WIDTH3-1 : 0] <= y [WIDTH3-1+WIDTH012 : WIDTH012] ; 
end 



//************* First stage of the adder ***************** 



csa7 add_0 
( . a(10) , .b(ll) , 
csa7 add_l 
( . a(12) , . b(13) , 
csa7 add_2 
( . a(14) , . b(15) , 
add_f f 8 add_3 
( . a(16) , .b(17) . 



// Add LSBs of x and y 
.clock(clk), . s(qO), .c(cqO)); 

// Add 2. LSBs of x and y 
.clock(clk), .s(ql), .c(cql) ); 

// Add 2. MSBs of x and y 
.clock(clk), .s(q2), .c(cq2) ); 

// Add MSBs of x and y 
.clock(clk), .s(q3) ); 



//************* Second stage of the adder ***************** 
// Two operands are zero 
assign h7 = {WIDTH2{1 ’bO}} ; 
assign h8 = {WIDTH3{1 >b0}} ; 



lpm_ff reg_l // Save qO 

( .data(qO), .q(v0), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 
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// .sset(sset), . aload(aload) , . sload(sload) ) ; //Unused 
defparam reg_l . lpm.width = WIDTHO; 

// Add result of 2. LSBs (x+y) and carry from LSBs 
csa7cin add_4 

(.a(ql), .b(h7), .cin(cqO), .clock(clk), . s(vl) , . c(cvl) ) ; 

// Add result of 2. MSBs (x+y) and carry from 2. LSBs 
csa7cin add_5 

(.a(q2), .b(h7), .cin(cql), .clock(clk), .s(v2), .c(cv2)); 

// Add result of MSBs (x+y) and carry from 2. MSBs 
add_ff8cin add_6 

( .a(q3), .b(h8), .cin(cq2), .clock(clk), .s(v3)); 

//************** Third stage of the adder ***************** 
lpm_ff reg_2 // Save vO 

( .data(vO), .q(rO), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// .sset(sset), . aload(aload) , . sload(sload) ) ; // Unused 
defparam reg_2 . lpm_width = WIDTHO; 

lpm_f f reg_3 // Save vl 

( .data(vl), .q(rl), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// .sset(sset), . aload(aload) , . sload(sload) ) ; // Unused 
defparam reg_3 . lpm.width = WIDTH1; 

// Add result of 2. MSBs (x+y) and carry from 2. LSBs 
csa7cin add_7 

( .a(v2), .b(h7), .cin(cvl), .clock(clk), .s(r2), .c(cr2)); 

// Add result of MSBs (x+y) and carry from 2. MSBs 
add_ff8cin add_8 

( .a(v3), .b(h8), .cin(cv2), .clock(clk), .s(r3) ); 

//************ Fourth stage of the adder ****************** 
lpm_f f reg_4 // Save rO 

( .data(rO), .q(sO), . clock(clk) ) ; // Used ports 
// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// .sset(sset), . aload(aload) , . sload(sload) ) ; //Unused 
defparam reg_4 . lpm.width = WIDTHO; 



lpm_ff reg_5 



// Save rl 
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( .data(rl), .q(sl), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// .sset(sset), . aload(aload) , . sload(sload) ) ; //Unused 
defparam reg_5 . lpm_width = WIDTH1; 

lpm_ff reg_6 // Save r2 

( .data(r2), .q(s2), . clock(clk) ) ; // Used ports 

// . enable(ena) , .aclr(aclr), .aset(aset), .sclr(sclr), 

// .sset(sset), . aload(aload) , . sload(sload) ) ; //Unused 
defparam reg_6 . lpm_width = WIDTH2; 

// Add result of MSBs (x+y) and carry from 2. MSBs 
add_ff8cin add_9 

( .a(r3), .b(h8), .cin(cr2), .clock(clk), .s(s3)); 

// Build a single output word of 

// WIDTH = WIDTHO + WIDTH1 + WIDTH2 + WIDTH3 

assign s = {s3 [WIDTH3-1 : 0] , s2 [WIDTH2-1 : 0] , 

s 1 [WIDTH1- 1 : 0] , sO [WIDTHO- 1 : 0] } ; 

assign sum = s ; // Connect s to output pins 

endmodule 

//************************************* ******************** 
// IEEE STD 1364-1995 Verilog file: mul.ser.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 
//********************************************************* 
module mul_ser (elk, x, a, y) ; // > Interface 

input elk; 

input [7:0] x, a; 
output [15:0] y; 
reg [15:0] y; 



always @(posedge elk) //-> Multiplier in behavioral style 
begin : States 

parameter s0=0, s 1=1 , s2=2; 
reg [2 : 0] count ; 
reg [1:0] state; 

reg [15:0] p, t; // Double bit width 

reg [7:0] a_reg; 
case (state) 
sO : begin 



// Initialization step 
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a_reg <= a; 
state <= si; 
count = 0; 

p <= 0; // Product register reset 

t <= {{8{x[7]}},x}; // Set temporary shift register 
end // to x 

si : begin // Processing step 

if (count == 7) // Multiplication ready 

state <= s2; 

else // Note that MaxPlusII does not does 

begin // not allow variable bit selects, 

if (a_reg [0] == 1) // see (LRM Sec. 4.2.1) 
p <= p + t; // Add 2~k 

a_reg <= a_reg » 1;// Use LSB for the bit select 
t <= t « 1; 
count = count + 1 ; 
state <= si; 
end 
end 

s2 : begin // Output of result to y and 

y <= p; // start next multiplication 

state <= sO; 
end 

endcase 

end 

endmodule 

+ + + + + ^ + + + + + + + + ^ + ^ + + + + + + ^ + ^ + + + + + + + + 
// IEEE STD 1364-1995 Verilog file: div_res.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 
//********************************************************* 
// Restoring Division 

// Bit width: WN WD WN WD 

// Nominator / Denumerator = Quotient and Remainder 

// OR: Nominator = Quotient * Denumerator + Remainder 

module div_res(clk, n_in, d_in, r_out, q_out); 

input elk; 

input [7:0] n_in; 
input [5:0] d_in; 
output [5:0] r_out; 
reg [5 : 0] r_out ; 
output [7 : 0] q_out ; 
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reg [7 : 0] q_out ; 

always @(posedge elk) //-> Divider in behavioral style 
begin : States 

parameter s0=0, sl=l, s2=2, s3=3; 
reg [3 : 0] count ; 
reg [1:0] state; 

reg [13:0] r, d; // Double bit width 

reg [7:0] q; 
case (state) 

sO : begin // Initialization step 

state <= si; 
count = 0; 

q <= 0; // Reset quotient register 

d <= d_in « 7 ; // Load aligned denumerator 
r <= {6* B0, n_in}; // Remainder = nominator 
end 

si : begin // Processing step 

r <= r - d; // Subtract denumerator 

state <= s2; 
end 

s2 : begin // Restoring step 

if (r[13] == 1) begin // Check r < 0 

r <= r + d; // Restore previous remainder 

q <= q << 1; // LSB = 0 and SLL 

end 
else 

q <= (q « 1) + 1; // LSB = 1 and SLL 
count = count + 1 ; 
d <= d » 1; 

if (count == 8) // Division ready ? 

state <= s3; 
else 

state <= si; 

end 

s3 : begin // Output of result 

q_out <= q[7 : 0] ; 
r_out <= r [5 : 0] ; 
state <= sO; // 
end 

endcase 
end 



Start next division 
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endmodule 

! / *:{:******************************************************* 
// IEEE STD 1364-1995 Verilog file: div.aegp.v 
// Author-EMAIL: Uwe.Meyer-Baese@ieee.org 

f ! ********************************************************* 
// Convergence division after 

// Anderson, Earle, Goldschmidt, and Powers 

// Bit width: WN WD WN WD 

// Nominator / Denumerator = Quotient and Remainder 

// OR: Nominator = Quotient * Denumerator + Remainder 

module div_aegp(clk, n_in, d_in, q_out); 

input elk; 

input [8:0] n_in; 
input [8:0] d_in; 
output [8:0] q_out; 
reg [8:0] q_out ; 

always ©(posedge elk) //-> Divider in behavioral style 
begin : States 

parameter s0=0, sl=l, s2=2; 
reg [1:0] count; 
reg [1:0] state; 

reg [9:0] x, t, f; // one guard bit 

case (state) 

sO : begin // Initialization step 

state <= si; 
count = 0; 

t <= {l’bO, d_in}; // Load denumerator 

x <= {l’bO, n_in}; // Load nominator 

end 

si : begin // Processing step 

f = 512 - t; // TWO - t 

x <= (x * f) » 8; // Factional f 

t <= (t * f) >> 8; // Scale by 256 

count = count + 1 ; 

if (count == 2) // Division ready ? 

state <= s2; 
else 

state <= si; 

end 

s2 : begin // Output of result 

q_out <= x [8 : 0] ; 
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state <= sO; // Start next division 
end 

endcase 

end 

endmodule 

// IEEE STD 1364-1995 Verilog file: cordic.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 

J J ^C3|C^C^e^C3(C3{C^C^C^C3(C^C3{C3|C3(C3(C^C3(C^e^C^C^C^C9((^^C^C^C^(^C^C3|C^(^C3{C3|C^(^C3(C3|C^e^C^C^C^(3|C^e3(C3(e^C^C3(C3|C^C^^C3(e 
module cordic (elk, x_in , y_in, r, phi, eps); 

parameter ¥ = 7; // Bit width - 1 

input elk; 

input [W:0] x_in, y_in; 

output [W:0] r, phi, eps; 

reg [W:0] r, phi, eps; 

// There is no bit access in 2D array types 
// in Verilog, therefore use single vectors 
reg [W:0] xO, yO, zO; 

reg [W:0] xl, yl, zl; 

reg [¥ : 0] x2, y2, z2; 

reg [W : 0] x3 , y3 , z3 ; 

always @(posedge elk) begin // > Infer register 

if (x_in > 0) // Test for x_in < 0 rotate 

begin // 0, +90, or -90 degrees 

xO <= x_in; // Input in register 0 
yO <= y_in; 

zO <= 0; 

end 

else if (y_in > 0) 
begin 

xO <= y_in; 
yO <= - x_in ; 
zO <= 90; 
end 
else 
begin 

xO <= - y_in ; 
yO <= x_in; 
zO <= -90; 
end 
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if (yO >0) // Rotate 45 degrees 

begin 

xl <= xO + yO; 

yl <= yO - xO; 

zl <= zO + 45; 

end 
else 
begin 

xl <= xO - yO; 

yl <= yO + xO; 

zl <= zO - 45; 

end 

if (yl >0) // Rotate 26 degrees 

begin 

x2 <= xl + -Cy 1 [W] , y 1 [W : 1] > ; // i.e. xl + yl /2 

y2 <= yl - {xl [W] ,xl [W: 1] } ; // i.e. yl - xl /2 

z2 <= zl + 26; 

end 
else 
begin 

x2 <= xl - {yl [W] ,yl [W: 1] } ; // i.e. xl - yl /2 

y2 <= yl + {xl [W] ,xl [W: 1] } ; // i.e. yl + xl /2 

z2 <= zl - 26; 

end 

if (y2 >0) // Rotate 14 degrees 

begin 

x3 <= x2 + {y2 [W] , y2 [W] , y2 [W : 2] } ; // i.e. x2 + y2/4 

y3 <= y2 - {x2[W] ,x2[W] ,x2[W:2]>; // i.e. y2 - x2/4 

z3 <= z2 + 14; 

end 
else 
begin 

x3 <= x2 - {y2 [¥] , y2 [W] , y2 [W : 2] } ; // i.e. x2 - y2/4 

y3 <= y2 + {x2[W] ,x2[W] ,x2[W:2]>; // i.e. y2 + x2/4 

z3 <= z2 - 14; 

end 

r <= x3; 
phi <= z3; 
eps <= y3; 
end 
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endmodule 

// IEEE STD 1364-1995 Verilog file: fir_gen.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 

//****** + 5jC5|C*3|C5|c*S|c******5|C + ********3|e*****3|C3|Cj(C3|e**j|e3|Cj|C5|C5|C3|C3|Cj|t*4:3|C**** 

// This is a generic FIR filter generator 
// It uses W1 bit data/coefficients bits 
module fir_gen (elk, Load_x, x_in, c_in, y_out); 

parameter W1 = 9, // Input bit width 

W2 = 18, // Multiplier bit width 2*W1 

W3 = 19, // Adder width = W2+log2(L)-l 

¥4 = 11, // Output bit width 

L =4, // Filter length 

Mpipe = 3; // Pipeline steps of multiplier 
input elk, Load_x; // std_logic 
input [¥1-1:0] x_in, c_in; // Inputs 
output [¥3-1:0] y_out; // Results 

reg [¥1-1:0] x; 
wire [¥3-1:0] y; 

// 2D array types i.e. memories not supported by MaxPlusII 
// in Verilog, use therefore single vectors 

reg [¥1-1:0] cO, cl, c2, c3; // Coefficient array 

wire [¥2-1:0] pO, pi, p2, p3; // Product array 

reg [¥3-1:0] aO, al, a2, a3; // Adder array 

wire [¥2-1:0] sum; // Auxilary signals 
wire elken, aclr; 

assign sum=0; assign aclr=0; // Default for mult 
assign clken=0; 

// > Load Data or Coefficient 

always @(posedge elk) 
begin: Load 
if ( ! Load_x) begin 

c3 <= c_in; // Store coefficient in register 
c2 <= c3; // Coefficients shift one 

cl <= c2; 
cO <= cl; 
end 

else begin 
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x <= x_in; // Get one data sample at a time 
end 
end 

// > Compute sum-of -products 

always @(posedge elk) 
begin: SOP 

// Compute the transposed filter additions 
aO <= {pO[W2-l], p0> + al; 
al <= {pl[W2-l], pi} + a2; 
a2 <= {p2 [W2-1] , p2> + a3; 

a3 <= {p3[W2-l], p3}; // First TAP has only a register 
end 

assign y = aO; 

// Instantiate L pipelined multiplier 

lpm_mult mul_0 // Multiply x*cO = pO 

( . clock(clk) , .dataa(x), .datab(cO), . result (pO) ) ; 

// .sum(sum), . clken(clken) , . aclr (aclr ) ) ; // Unused ports 
defparam mul_0 . lpm_widtha = Wl; 
defparam mul_0 . lpm.widthb = Wl; 
defparam mul_0 . lpm_widthp = W2; 
defparam mul_0 . lpm_widths = W2; 
defparam mul_0 . lpm_pipeline = Mpipe; 
defparam mul_0 . lpm_representation = "SIGNED"; 

lpm_mult mul_l // Multiply x*cl = pi 

( . clock(clk) , .dataa(x), .datab(cl), . result (pi )) ; 

// .sum(sum), . clken(clken) , . aclr (aclr) ) ; // Unused ports 
defparam mul_l . lpm_widtha = Wl; 
defparam mul_l . lpm_widthb = Wl; 
defparam mul_l . lpm.widthp = W2; 
defparam mul_l . lpm_widths = W2; 
defparam mul_l . lpm_pipeline = Mpipe; 
defparam mul_l . lpm_representation = "SIGNED"; 

lpm_mult mul_2 // Multiply x*c2 = p2 

( . clock(clk) , .dataa(x), .datab(c2), . result (p2) ) ; 

// .sum(sum), . clken(clken) , . aclr (aclr) ) ; // Unused ports 
defparam mul_2 . lpm_widtha = Wl; 
defparam mul_2 . lpm_widthb = Wl; 
defparam mul_2 . lpm_widthp = W2; 
defparam mul_2 . lpm_widths = W2; 
defparam mul_2 . lpm_pipeline = Mpipe; 
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defparam mul_2 . lpm_representation = "SIGNED"; 

lpm_mult mul_3 // Multiply x*c3 = p3 

( . clock(clk) , .dataa(x), .datab(c3), . result (p3) ) ; 

// .sum(sum), . clken(clken) , . aclr (aclr) ) ; // Unused ports 
defparam mul_3 . lpm_widtha = Wl; 
defparam mul_3 . lpm_widthb = ¥1; 
defparam mul_3 . lpm_widthp = ¥2; 
defparam mul_3 . lpm_widths = ¥2; 
defparam mul_3 . lpm_pipeline = Mpipe; 
defparam mul_3 . lpm_representation = "SIGNED"; 

assign y_out = y [¥3-l :¥3-¥4] ; 

endmodule 

//********************************************************* 
// IEEE STD 1364-1995 Verilog file: fir_srg.v 
// Author-EMAIL: Uwe.Meyer-Baese@ieee.org 

//jtt******************************************************** 

module fir_srg (elk, x, y) ; // > Interface 

input elk; 

input [7:0] x; 
output [7:0] y; 
reg [7:0] y; 

// Tapped delay line array of bytes 
reg [7:0] tapO, tapl, tap2, tap3; 

// For bit access use single vectors in Verilog 

always @(posedge elk) // > Behavioral Style 

begin : pi 

// Compute output y with the filter coefficients weight. 
// The coefficients are [-1 3.75 3.75 -1]. 

// Multiplication and division for Altera MaxPlusII can 
// be done in Verilog with sign extensions and shifts! 
y <= (tapl«l) + tapl + {tapl [7] ,tapl [7 : 1] > 

+ {tapl [7] , tapl [7] , tapl [7 : 2] > + (tap2«l) + tap2 
+ {tap2 [7] , tap2 [7 : 1] } 

+ {tap2[7] ,tap2[7] ,tap2[7:2]> - tap3 - tapO; 

tap3 <= tap2; // Tapped delay line: shift one 
tap2 <= tapl; 
tapl <= tapO; 

tapO <= x; // Input in register 0 
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end 

endmodule 

//********************************************************* 
// IEEE STD 1364-1995 Verilog file: dafsm.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 

//***********S|C***3|c*****3|C****S|C*5|C*+5|C3jC*3fc*3|cj|c*+**** + ^^*^*****^^5t{ 

‘ include M case3.v" // User defined component 

module dafsm (elk, x_inO, x_inl, x_in2, y) ; // — > Interface 

input elk; 

input [2:0] x_inO, x_inl, x_in2; 

output [5:0] y; 

reg [5:0] y; 

reg [2:0] xO, xl, x2; 

wire [2:0] table.in, table_out; 

reg [5:0] p; // temporary register 

assign table_in[0] = x0[0]; 
assign table_in[l] = xl[0]; 
assign table_in[2] = x2[0]; 

always @(posedge elk) // > DA in behavioral style 

begin : DA 

parameter s0=0, s 1=1 ; 
reg [0:0] state; 

reg [1:0] count; // Counts the shifts 
case (state) 

sO : begin // Initialization 

state <= si; 
count = 0; 
p <= {6{l>b0}}; 
xO <= x_in0; 
xl <= x_inl; 
x2 <= x_in2; 
end 

si : begin // Processing step 

if (count == 3) begin // Is sum of product done? 
y <= p; // Output of result to y and 

state <= sO; // start next sum of product 

end 

else begin 
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p <= {p[5] , p [5 



1]> + {1 'bOjtable.out^'bOO}; 



xO [0] <= xO [1] 
xO[l] <= x0[2] 
xl[0] <= xl[l] 
xl[l] <= xl[2] 
x2 [0] <= x2 [1] 
x2 [1] <= x2 [2] 



count = count + 1; 
state <= si; 
end 



end 



endcase 



end 



case3 LC_TableO 

( . table_in(table_in) , . table_out (table_out ) ) ; 
endmodule 

//****** * ********* ************ * ** **** ** * ** ** * *** * * **** **** * 
// IEEE STD 1364-1995 Verilog file: case3.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 
//********************************************************* 
module case3 (table_in, table_out); 

input [2:0] table_in; // Three bit 
output [2:0] table_out; // Range 0 to 6 

reg [2:0] table_out; 

// This is the DA CASE table for 
// the 3 coefficients: 2, 3, 1 

always @(table_in) 
begin 

case (table_in) 

0 : table_out = 0; 

1 : table_out = 2; 

2 : table_out = 3; 

3 : table_out = 5; 

4 : table_out = 1; 

5 : table_out = 3; 

6 : table_out = 4; 

7 : table_out = 6; 

default : ; 
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endcase 

end 

endmodule 

//********************************************************* 

// IEEE STD 1364-1995 Verilog file: case5p.v 

// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 

//*3jC*5|C** + *^*3t!3(C3jC3|C5|C***5|C5|C**3jCjJC*3te + + ********* + *5jC)JC***5|C*5jC*********5|C 

module case5p (elk, table_in, table_out); 

input elk; 

input [4:0] table_in; 

output [4:0] table_out; // range 0 to 25 

reg [4:0] table_out; 
reg [3:0] lsbs; 
reg [1:0] msbsO; 

reg [4:0] tableOoutOO, tableOoutOl; 

// These are the distributed arithmetic CASE tables for 

// the 5 coefficients: 1, 3, 5, 7, 9 

always @(posedge elk) begin 
lsbs[0] = table_in[0]; 
lsbs[l] = table_in[l]; 
lsbs [2] = table_in[2]; 
lsbs [3] = table_in[3]; 
msbsO [0] = table_in[4]; 
msbsO [1] = msbsO [0]; 
end 

// This is the final DA MPX stage, 
always @(posedge elk) begin 
case (msbsO [1]) 

0 : table_out <= tableOoutOO; 

1 : table_out <= tableOoutOl; 
default : ; 

endcase 

end 

// This is the DA CASE table 00 out of 1. 
always @(posedge elk) begin 
case (lsbs) 

0 : tableOoutOO = 0; 
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1 


tableOoutOO = 1; 


2 


tableOoutOO = 3; 


3 


tableOoutOO = 4; 


4 


tableOoutOO = 5; 


5 


tableOoutOO = 6; 


6 


tableOoutOO = 8; 


7 


tableOoutOO = 9; 


8 


tableOoutOO = 7; 


9 


tableOoutOO = 8; 


10 


tableOoutOO = 10 


11 


tableOoutOO = 11 


12 


tableOoutOO = 12 


13 


tableOoutOO = 13 


14 


tableOoutOO = 15 


15 


tableOoutOO = 16 


default ; 



endcase 

end 

// This is the DA CASE table 01 out of 1. 
always ©(posedge elk) begin 
case (lsbs) 

0 : tableOoutOl = 9; 

1 : tableOoutOl = 10; 

2 : tableOoutOl = 12; 

3 : tableOoutOl = 13; 

4 : tableOoutOl = 14; 

5 : tableOoutOl = 15; 

6 : tableOoutOl = 17; 

7 : tableOoutOl = 18; 

8 : tableOoutOl = 16; 

9 : tableOoutOl = 17; 

10 : tableOoutOl = 19; 

11 : tableOoutOl = 20; 

12 : tableOoutOl = 21; 

13 : tableOoutOl = 22; 

14 : tableOoutOl = 24; 

15 : tableOoutOl = 25; 
default ; 

endcase 

end 

endmodule 
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// IEEE STD 1364-1995 Verilog file: darom.v 
// Author-EMAIL: Uwe.Meyer-Baese@ieee.org 

//' include "220model.v" 



module darom (elk, x_inO, x_inl, x_in2, y) ; // — > Interface 



input 

input [2 : 0] 
output [5:0] 
reg [5 : 0] 
reg [2 : 0] 
wire [2:0] 



elk; 

x_in0, x_inl, x_in2; 

y; 

y; 

x0, xl, x2; 
table_in, table_out; 



reg [5:0] p; // Temporary register 
wire ena; 



assign ena=l; 

assign table_in[0] = x0[0]; 
assign table_in[l] = xl[0]; 
assign table_in[2] = x2[0]; 

always ©(posedge elk) // > DA in behavioral style 

begin : DA 

parameter s0=0, s 1=1 ; 
reg [0:0] state; 

reg [1:0] count; // Counts the shifts 
case (state) 

sO : begin // Initialization 
state <= si; 
count = 0; 

P <= 0; 
xO <= x_in0; 
xl <= x_inl; 
x2 <= x_in2; 
end 

si : begin // Processing step 

if (count == 3) begin // Is sum of product done? 
y <= p; // Output of result to y and 

state <= sO; // start next sum of product 

end 

else begin 

p <= {p[5] ,p[5 : 1] > + {1 , b0,table_out,2 , b00>; 
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xO [0] <= xO [1] ; 
x0[l] <= x0[2] ; 
xl [0] <= xl [1] ; 
xl [1] <= xl [2] ; 
x2[0] <= x2[l] ; 
x2[l] <= x2 [2] ; 
count = count + 1; 
state <= si; 
end 
end 

default : ; 
endcase 
end 

lpm_rom rom_l 

(. address (table_in) , . q(table_out) ) ; // Used ports 

// . inclock(clk) , . outclock(clk) , .memenab(ena) ) ; // Unused 
defparam rom_l . lpm_width = 3; 
defparam rom_l . lpm_widthad = 3; 
defparam rom_l . lpm_outdata = "UNREGISTERED " ; 
defparam rom_l . lpm_address_control = "UNREGISTERED"; 
defparam rom_l . lpm_f ile = "darom3 .mif " ; 

endmodule 

//3|C*9|C3|c4c**4c^^4c^*^3(c4c3|C9|c3|C^4ct*******************3|(3|c***9|(9|C*****9|C*** 

// IEEE STD 1364-1995 Verilog file: dasign.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 
//********************************************************* 
‘ include "case3s.v" // User defined component 

module dasign (elk, x_inO, x_inl, x_in2, y) ; //-> Interface 

input elk; 

input [3:0] x_inO, x_inl, x_in2; 

output [6:0] y; 

reg [6:0] y; 

reg [3:0] xO, xl, x2; 

wire [2:0] table_in; 

wire [3:0] table_out; 

reg [6:0] p; // Temporary register 

assign table_in[0] = x0[0]; 
assign table_in[l] = xl[0]; 
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assign table_in[2] = x2[0]; 

always @(posedge elk) // > DA in behavioral style 

begin : DA 

parameter s0=0, s 1=1 ; 

integer k; 

reg [0:0] state; 

reg [2:0] count; // Counts the shifts 

case (state) 

sO : begin // Initialization step 

state <= si; 
count = 0; 
p <= 0; 
xO <= x_in0; 
xl <= x_inl; 
x2 <= x_in2; 
end 

si : begin // Processing step 

if (count == 4) begin // Is sum of product done? 
y <= p; // Output of result to y and 

state <= sO; // start next sum of product 

end 

else begin // Subtract for last accumulator step 
if (count ==3) // i.e. p/2 +/- table_out * 8 

p <= *Cp [6] , p [6 : 1] } - (table.out « 3); 
else // Accumulation for all other steps 

p <= {p[6] ,p[6: 1]> + (table_out « 3); 
for (k=0; k<=2; k= k+1) begin // Shift bits 
x0[k] <= x0[k+l]; 
xl [k] <= xl [k+1] ; 
x2[k] <= x2[k+l]; 
end 

count = count + 1 ; 
state <= si; 
end 
end 

endcase 

end 

case3s LC_Table0 

( . table_in(table_in) , . table_out (table_out ) ) ; 
endmodule 

//********************************************************* 
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// IEEE STD 1364-1995 Verilog file: case3s.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 

//♦**j|c***5(c*******************5tc************** + *5f:*j|t*******3j c *5(c 

module case3s (table_in, table_out); 
input [2:0] table.in; // Three bit 

output [3:0] table_out; // Range -2 to 4 -> 3 + sign bit 

reg [3:0] table_out; 

// This is the DA CASE table for 
// the 3 coefficients: -2, 3, 1 

always @(table_in) 
begin 

case (table_in) 

0 : table_out = 

1 : table_out = 

2 : table_out = 

3 : table_out = 

4 : table_out = 

5 : table_out = 

6 : table_out = 

7 : table_out = 

default : ; 

endcase 
end 

endmodule 

//********************************************************* 
// IEEE STD 1364-1995 Verilog file: dapara.v 
// Author-EMAIL: Uwe.Meyer-Baese@ieee.org 

‘ include "case3s.v M // User defined component 

module dapara (elk, x_in, y) ; // > Interface 

input elk; 

input [3:0] x_in; 
output [6:0] y; 
reg [6:0] y; 

reg [2:0] xO, xl, x2, x3; 
wire [3:0] yO, yl, y2, y3; 



0 ; 

- 2 ; 

3; 

1 ; 

1 ; 

-i; 

4; 

2 ; 
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reg [4:0] sO, si; 

reg [3:0] tO, tl, t2, t3; 

always @(posedge elk) // > DA in behavioral style 

begin : DA 
integer k; 

for (k=0; k<=l; k=k+l) begin // Shift all four bits 
x0[k] <= x0[k+l]; 
xl[k] <= xl [k+1] ; 
x2[k] <= x2 [k+1] ; 
x3[k] <= x3[k+l]; 

end 

x0[2] <= x_in[0]; // Load x_in in the 

xl[2] <= x_in[l] ; // MSBs of register 2 

x2 [2] <= x_in [2] ; 
x3[2] <= x__in[3] ; 

y <= {{3{yO[3]»,yO} + {{2{yl [3] }} ,yl , 1 >b0} 

+ {y2 [3] ,y2,2>b00} - (y3 « 3); 

// Sign extensions, pipeline register, and adder tree: 

// tO <= yO; tl <= yl; t2 <= y2; t3 <= y3; 

// sO <= {tO [3] , tO} + (tl « 1); 

// si <= {t2[3],t2> - (t3 «1); 

// y <= {{2{s0[4]»,s0> + (si « 2); 

end 



case3s LC_TableO 
case3s LC_Tablel 
case3s LC_Table2 
case3s LC_Table3 



. table_in(xO) , 
. table_in(xl) , 
. table_in(x2) , 
. table_in(x3) , 



. table_out(yO) ) 
. table_out (yl) ) 
. table_out(y2) ) 
. table_out (y3) ) 



endmodule 



//********************************************************* 
// IEEE STD 1364-1995 Verilog file: iir.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 

//^ + + + + + + 4:^ + + + ******************************** + * + =f:*** + + * + + + 

module iir ( x_in, // Input 

y_out , // Result 

elk) ; 

parameter W = 14; // Bit width - 1 

input [W:0] x_in; 
output [W:0] y_out; 
input elk; 

reg [W : 0] x, y; 




A. Verilog Source Code 463 



// initial begin 
// y=0; 

// x=0 ; 

// end 

// Use FFs for input and recursive part 

always @(posedge elk) begin // Note: there is no signed 
x <= x_in; // integer in Verilog 

y <= x + {y [W] ,y [W: 1]} + {{2{y [W]»,y [W:2]>; 

// i.e. x+y/2+y/4; 

end 

assign y_out = y; // Connect y to output pins 

endmodule 

//********************************************************* 
// IEEE STD 1364-1995 Verilog file: iir_pipe.v 
// Author-EMAIL: Uwe.Meyer-Baese@ieee.org 
//********************************************************* 
module iir_pipe (x_in, y_out, elk); // > Interface 

parameter W = 14; // Bit width - 1 
input elk; 

input [W:0] x_in; // Input 

output [W:0] y_out ; // Result 

reg [W:0] x, x3, sx; 
reg [W : 0] y, y9; 

always @(posedge elk) // Infer FFs for input, output and 
begin // pipeline stages; 

x <= x_in; // use non-blocking FF assignments 

x3 <= {x[W] ,x[W: 1]> + {x[W] ,x[W] ,x[W:2]>; 

//i.e. x/2+x/4= x*3/ 4 
sx <= x + x3; // Sum of x element i.e. output FIR part 
y9 <= {y [W] ,y [W: 1]> + {{4{y [W] » ,y [W: 4] > ; 

// i.e. y/2+y/ 16= y*9/16 
y <= sx + y9; // Compute output 

end 



assign y_out = y ; // Connect register y to output pins 



endmodule 
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//***************************************** **************** 
// IEEE STD 1364-1995 Verilog file: iir_par.v 
// Author-EMAIL: Uwe.Meyer-Baese@ieee.org 
//********************************************************* 
module iir_par (elk, x_in, clk2, y_out); // > Interface 



parameter W = 14; // bit 
input elk; 

input [W:0] x_in; 
output [W:0] y_out; 
output clk2; 



width - 1 



reg [W:0] x_even, x_odd, xd_odd, x_wait ; 
reg [W:0] y_even, y_odd, y_wait , y; 
reg [¥:0] x_e, x_o, y_e, y_o; 
reg [W:0] sum_x_even, sum_x_odd; 
reg clk_div2; 



always @(posedge elk) // Clock divider by 2 

begin : clk_divider // for input elk 

clk_div2 <= ! clk_div2; 

end 



always @(posedge elk) 
begin : Multiplex 

parameter even=0, odd=l 
reg [0:0] state; 
case (state) 
even : begin 

x.even <= x_in; 
x_odd <= x_wait; 
y <= y_wait; 
state <= odd; 

end 

odd : begin 

x_wait <= x_in; 
y <= y_odd; 
y_wait <= y_even; 
state <= even; 

end 

endcase 

end 



// Split x into even 
// and odd samples; 

// recombine y at elk rate 



assign y_out = y; 
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assign clk2 = clk_div2; 

always ©(negedge clk_div2) 
begin: Arithmetic 

sum_x_even <= x_odd + {x_even [¥] ,x_even [¥: 1] } 

+ {x_even [W] , x_even [¥] , x_even [¥ : 2] } ; 
// i.e. x_odd + x_even / 2 + x_even /4 
y_even <= sum_x_even + {y_even[¥] ,y_even[¥: 1]} 

+ {{4{y_even[¥]}},y_even[¥:4]}; 
// i.e. sum_x_even + y_even / 2 + y_even /16 
xd_odd <= x_odd; 

sum_x_odd <= x_even + {xd_odd [¥] ,xd_odd [¥: 1] } 

+ {xd_odd [¥] , xd_odd [¥] , xd_odd [¥ : 2] } ; 
// i.e. x_even + xd_odd / 2 + xd_odd /4 
y_odd <= sum_x_odd + {y_odd[¥] ,y_odd[¥: 1]} 

+ {{4{y_odd[¥]»,y_odd[¥:4]}; 
// i.e. sum_x_odd + y_odd / 2 + y_odd / 16 

end 

endmodule 

//********************************************************* 
// IEEE STD 1364-1995 Verilog file: cic3r32.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 
//********************************************************* 
module cic3r32 (elk, x_in, clk2, y_out); // > Interface 

input elk; 

input [7:0] x_in; 
output [9 : 0] y_out ; 
output clk2; 
reg clk2; 

parameter hold=0, sample=l; 
reg [1:0] state; 
reg [4 : 0] count ; 

reg [7:0] x; // Registered input 

wire [25:0] sxtx; // Sign extended input 

reg [25:0] iO, il , i2; //I section 0, 1, and 2 

reg [25:0] i2dl, i2d2, i2d3, i2d4, cl, cO; // I + COMB 0 

reg [25:0] cldl, cld2, cld3, cld4, c2; // COMB section 1 

reg [25:0] c2dl, c2d2, c2d3, c2d4, c3; // COMB section 2 

always ©(posedge elk) 
begin : FSM 

if (count == 31) begin 
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count <= 0; 
state <= sample; 
clk2 <= 1 ; 
end 

else begin 

count <= count + 1; 
state <= hold; 
clk2 <= 0; 
end 

end 

assign sxtx = {{18{x[7]}},x}; 

always @(posedge elk) 



begin : 


Int 








X 


<= 


x_ 


in; 




iO 


<= 


iO 


+ 


sxtx 


il 


<= 


il 


+ 


iO ; 


i2 


<= 


i2 


+ 


il ; 



end 



always <9(posedge elk) 
begin : Comb 

if (state == sample) begin 
cO <= i2; 
i2dl <= cO; 
i2d2 <= i2dl; 
cl <= cO - i2d2; 
cldl <= cl; 
cld2 <= cldl; 
c2 <= cl - cld2; 

c2dl <= c2; 
c2d2 <= c2dl; 
c3 <= c2 - c2d2; 

end 
end 

assign y_out = c3[25:16]; 
endmodule 

//********************************************************* 
// IEEE STD 1364-1995 Verilog file: cic3s32.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 

//♦sit******************************************************* 




A. Verilog Source Code 467 



module cic3s32 (elk, x_in, clk2, y_out); // > Interface 

input elk; 

input [7:0] x_in; 
output [9 : 0] y_out ; 
output clk2; 
reg clk2 ; 

parameter hold=0, sample=l; 
reg [1:0] state; 
reg [4:0] count; 

reg [7:0] x; // Registered input 

wire [25:0] sxtx; // Sign extended input 

reg [25:0] iO; //I section 0 

reg [20:0] il; //I section 1 

reg [15:0] i2; //I section 2 

reg [13:0] i2dl, i2d2, i2d3, i2d4, cl, cO; // I + COMB 0 

reg [12:0] cldl, cld2, cld3, cld4, c2; // COMB section 1 

reg [11:0] c2dl, c2d2, c2d3, c2d4, c3; // COMB section 2 

always @(posedge elk) 
begin : FSM 

if (count == 31) begin 
count <= 0; 
state <= sample; 
clk2 <= 1; 
end 

else begin 

count <= count + 1; 
state <= hold; 
clk2 <= 0; 
end 

end 

assign sxtx = {{18{x[7]}},x}; 

always ©(posedge elk) 
begin : Int 

x <= x_in; 

10 <= iO + sxtx; 

11 <= il + iO [25 : 5] ; 

12 <= i2 + il [20 : 5] ; 



end 
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always @(posedge elk) 
begin : Comb 

if (state == sample) begin 
cO <= i2 [15 : 2] ; 
i2dl <= cO; 
i2d2 <= i2dl; 
cl <= cO - i2d2; 
cldl <= cl [13 : 1] ; 
cld2 <= cldl; 
c2 <= cl [13 : 1] - cld2 ; 
c2dl <= c2 [12 : 1] ; 
c2d2 <= c2dl ; 
c3 <= c2 [12 : 1] - c2d2 ; 
end 
end 

assign y_out = c3[ll:2]; 

endmodule 

J ! ^C3|C^C3|C^C^c4C3|C^C%^C^C^C3|C^C3(C^C^C^C^C^C^C3|C3|C3(C^C^C^C^C^C3|c^C^C^C^e^C^C^C^C^C^C^C3(C3(C^C3(C^c4c^C^C^C^C^C^C3|C^C^C 

// IEEE STD 1364-1995 Verilog file: db4poly.v 

// Author-EMAIL: Uwe.Meyer-Baese@ieee.org 

module db4poly (elk, x_in, clk2, x_e, x_o, gO, gl, y_out); 

input elk; 

output clk2; 

input [7:0] x_in; 

output [16:0] x_e, x_o, gO, gl; // Test signals 
output [8 : 0] y_out ; 

reg [7:0] x_odd, x_even, x_wait ; 
wire [16:0] x_odd_sxt, x_even_sxt; 
reg clk_div2; 

// Register for multiplier, coefficients, and taps 
reg [16:0] mO, ml, m2, m3, rO, rl, r2, r3; 
reg [16:0] x33, x99, xl07; 
reg [16:0] y; 

always @(posedge elk) // Split into even and odd 
begin : Multiplex // samples at elk rate 

parameter even=0, odd=l; 
reg [0:0] state; 
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case (state) 
even : begin 

x_even <= x_in; 
x_odd <= x_wait; 
clk_div2 = 1; 
state <= odd; 
end 

odd : begin 

x_wait <= x_in; 
clk_div2 = 0; 
state <= even; 
end 

endcase 

end 

assign x_odd_sxt = {{9{x_odd [7] }} ,x_odd> ; 
assign x_even_sxt = {{9{x_even [7] }} , x_even} ; 

always @(x_odd_sxt or x_even_sxt) 
begin : RAG 

// Compute auxiliary multiplications of the filter 
x33 = (x_odd_sxt << 5) + x_odd_sxt; 

x99 = (x33 « 1) + x33; 

xl07 = x99 + (x_odd_sxt << 3); 

// Compute all coefficients for the transposed filter 

mO = (x_even_sxt << 7) - (x_even_sxt << 2); // mO = 124 
ml = xl07 « 1; // ml = 214 

m2 = (x_even_sxt << 6) - (x_even_sxt « 3) 

+ x_even_sxt; // m2 = 57 

m3 = x33; // m3 = -33 

end 

always @(negedge clk_div2) // Infer registers; 

begin : AddPolyphase // use non-blocking assignments 

// Compute filter GO 

rO <= r2 + mO; // gO = 128 

r2 <= m2; // g2 = 57 

// Compute filter G1 

rl <= -r3 + ml; // gl = 214 

r3 <= m3; // g3 = -33 

// Add the polyphase components 
y <= rO + rl; 
end 
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// Provide some test signal as outputs 
assign x_e = x_even; 
assign x_o = x_odd; 
assign clk2 = clk_div2; 
assign gO = rO; 
assign gl = rl; 

assign y_out = y[16:8]; // Connect y / 256 to output 
endmodule 

//***5|C3(CS|e*)|Cj|C*)|C*3|C********5|C*)|C***5|c***5|C5|C**j|C*********5|C*****3|C5jC3|C*** 

// IEEE STD 1364-1995 Verilog file: db41atti.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 

module db41atti (elk, x_in, clk2, x_e, x_o, g, h) ; 

input elk; 

output clk2; 

input [7:0] x_in; 
output [16:0] x_e, x_o; 

output [8:0] g, h; 

reg [8:0] g, h; 

reg [7:0] x_wait; 

wire [16:0] x_wait_sxt, x_in_sxt; 

reg [16:0] sx_up, sx_low; 

wire [24:0] sx_up_sxt, sx_low_sxt; 

reg clk_div2; 

wire [16:0] sxa0_up, sxa0_low; 
wire [16:0] upO, upl, lowl; 
reg [16:0] lowO; 
wire [24:0] up0_sxt, low0_sxt; 

assign x_in_sxt = {{9{x_in [7] }} , x_in} ; 
assign x_wait_sxt = {{9{x_wait [7] }} ,x_wait} ; 

always ©(posedge elk) // Split into even and odd 

begin : Multiplex // samples at elk rate 

parameter even=0, odd=l; 
reg [0:0] state; 
case (state) 
even : begin 

// Multiply with 256*s=124 

sx_up <= (x_in_sxt << 7) - (x_in_sxt « 2); 
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sx_low <= (x_wait_sxt << 7) - (x_wait_sxt << 2); 
clk_div2 <= 1; 
state <= odd; 
end 

odd : begin 

x_wait <= x_in; 
clk_div2 <= 0; 
state <= even; 
end 

endcase 

end 

//******** Multipy a[0] = 1.7321 

assign sx_up_sxt = {{8{sx_up[16]}},sx_up}; 
assign sx_low_sxt = {{8{sx_low[16]»,sx_low}; 

// i.e. sign extensions 

// Compute: (2*sx_up - sx_up /4)-(sx_up /64 + sx_up /256) 
assign sxa0_up = ((sx_up_sxt « 1) - (sx_up_sxt » 2)) 

- ((sx_up_sxt » 6) + (sx_up_sxt » 8)); 
// Compute: (2*sx_low - sx_low/4)-(sx_low/64 + sx_low/256) 
assign sxa0_low = ((sx_low_sxt « 1) - (sx_low_sxt » 2)) 

- ((sx_low_sxt >> 6) + (sx_low_sxt » 8)); 
//******** First stage — FF in lower tree 

assign upO = sxa0_low + sx_up; 
always @(negedge clk_div2) 
begin: LowerTreeFF 

lowO <= sx_low - sxa0_up; 

end 

//******** Second stage: a[l]=0.2679 
// Compute: (upO - lowO/4) - (lowO/64 + lowO/256) ; 

assign up0_sxt = {{8{up0 [16] » ,up0> ; 
assign low0_sxt = {{8{low0 [16] » , lowO} ; 
assign upl = (up0_sxt - (low0_sxt >> 2)) 

- ((Iow0_sxt » 6) + (low0_sxt » 8)); 

// Compute: (lowO + up0/4) + (upO/64 + upO/256) 

assign lowl = (low0_sxt + (up0_sxt >> 2)) 

+ ((up0_sxt » 6) + (up0_sxt » 8)); 

assign x_e = sx_up; // Provide some extra 

assign x_o = sx_low; // test signals 

assign clk2 = clk_div2; 

always @(negedge clk_div2) 
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begin: OutputScale 

g <= upl[16:8]; // i.e. upl / 256 

h <= lowl[16:8]; // i.e. lowl / 256; 

end 

endmodule 

//********************************************************* 
// IEEE STD 1364-1995 Verilog file: rader7.v 
// Author-EMAIL: Uwe.Meyer-Baese@ieee.org 
//********************************************************* 
module rader7 (elk, x_in, y_real, y_imag) ; // > Interface 

input elk; 

input [7:0] x_in; 
output [10:0] y_real, y_imag; 
reg [10:0] y_real, y_imag; 

reg [10:0] accu; // Signal for X[0] 

// Note: No direct bit access of 2D vector in Verilog 
// use auxiliary signal for this purpose 
reg [18:0] imagO, imagl, imag2, imag3, imag4, imag5, 
realO, reall, real2, real3, real4, real5; 

// Tapped delay line array 
reg [18:0] x57, xlll, xl60, x200, x231, x250 ; 

// The filter coefficients 
reg [18:0] x5, x25, xllO, xl25, x256; 

// Auxiliary filter coefficients 
reg [7:0] x, x_0; // Signals for x[0] 

wire [18:0] x_sxt, x_0_sxt; 

assign x_sxt = {{9{x [7] }} ,x> ; // Sign extension of input 
assign x_0_sxt = {{9{x_0 [7] }} ,x_0}; // and x[0] 

always @(posedge elk) // State machine for RADER filter 
begin : States 

parameter Start=0, Load=l, Run=2; 
reg [1:0] state; 
reg [4:0] count; 
case (state) 

Start : begin // Initialization step 

state <= Load; 
count <= 1 ; 

x_0 <= x_in; // Save x[0] 

accu <= 0 ; // Reset accumulator for X[0] 
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y_real <= 0; 
y_imag <= 0; 
end 

Load : begin // Apply x[5] ,x[4] ,x[6] ,x[2] ,x[3] ,x[l] 
if (count == 8) // Load phase done ? 

state <= Run; 
else begin 

state <= Load; 
accu <= accu + x_sxt; 
end 

count <= count + 1 ; 
end 

Run : begin // Apply again x[5] ,x[4] ,x[6] ,x[2] ,x[3] 
if (count == 15) begin // Run phase done ? 
y_real <= accu; // X[0] 

y_imag <= 0; // Only re inputs i.e. Im(X[0])=0 

state <= Start; // Output of result 

end // and start again 

else begin 

y_real <= (realO » 8) + x_0_sxt; 

// i.e. real [0] /256+x [0] 
y_imag <= (imagO » 8) ; //i.e. imag[0]/256 

state <= Run; 
end 

count <= count + 1; 
end 

endcase 

end 

always @(posedge elk) // Structure of the two FIR 
begin : Structure // filters in transposed form 

x <= x_in; 

// Real part of FIR filter in transposed form 



realO 


<= reall + xl60 ; 


// 


tri 


reall 


<= real2 - x231 ; 


// 


W~3 


real2 


<= real3 - x57 ; 


// 


W~2 


real3 


<= real4 + xl60 ; 


// 


W~6 


real4 


<= real5 - x231 ; 


// 


W~4 


real5 


<= -x57 ; 


// 


W~5 



// Imaginary part of FIR filter in transposed form 

imagO <= imagl - x200 ; // W~1 

imagl <= imag2 - xlll ; // W~3 

imag2 <= imag3 - x250 ; // W~ 2 
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imag3 <= imag4 + x200 ; // W~6 

imag4 <= imag5 + xlll ; // VT4 

imag5 <= x250; // W~5 

end 

always ©(posedge elk) // Note that all signals 

begin : Coeffs // are globally defined 

// Compute the filter coefficients and use FFs 
xl60 <= x5 « 5; // i.e. 160 = 5 * 32; 

x200 <= x25 « 3; // i.e. 200 = 25 * 8; 

x250 <= xl25 « 1; // i.e. 250 = 125 * 2; 

x57 <= x25 + (x « 5); //i.e. 57 = 25 + 32; 

xlll <= xllO + x; // i.e. Ill = 110 + 1; 

x231 <= x256 - x25; // i.e. 231 = 256 - 25; 

end 

always ©(x_sxt or x5 or x25) // Note that all signals 

begin : Factors // are globally defined 

// Compute the auxiliary factor for RAG without an FF 
x5 = (x_sxt << 2) + x_sxt; // i.e. 5=4+ 1; 

x25 = (x5 « 2) + x5; // i.e. 25 = 5*4 + 5; 

xllO = (x25 « 2) + (x5 « 2);// i.e. 110 = 25*4+5*4; 

xl25 = (x25 « 2) + x25 ; // i.e. 125 = 25*4+25; 

x256 = x_sxt « 8; //i.e. 256 = 2 ** 8; 

end 

endmodule 

// IEEE STD 1364-1995 Verilog file: ccmul.v 
// Author-EMAIL: Uwe . Meyer-Baese©ieee . org 
//***************************** **************************** 
//‘include M 220model . v" 

module ccmul (elk, x_in, y_in, c_in, 

cps_in, cms_in, r_out, i_out); 

parameter W2 = 17, // Multiplier bit width 

W1 = 9, // Bit width c+s sum 

W = 8; // Input bit width 

input elk; // Clock for the output register 
input [W-1:0] x_in, y_in, c_in; // Inputs 
input [Wl-1:0] cps_in, cms_in; // Inputs 
output [W-1:0] r_out, i_out; // Results 

reg [W-1:0] r_out , i_out ; // Results 
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wire [W-1:0] x, y, c ; // Inputs and outputs 

wire [W2-l:0] r, i, cmsy, cpsx, xmyc, sum; // Products 
wire [Wl-1:0] xmy, cps, cms, sxtx, sxty; // x-y etc. 



wire clken, crl, ovll, cinl, aclr, ADD, SUB; 

// Auxiliary signals 
assign cinl=0; assign aclr=0; assign ADD=1; assign SUB=0; 
assign crl=0; assign sum=0; assign clken=0; 

// Default for add 

assign x = x_in; // x 

assign y = y_in; // j * y 

assign c = c_in; // cos 

assign cps = cps_in; // cos + sin 

assign cms = cms_in; // cos - sin 



always @(posedge elk) begin 

r_out <= r[W2-2:W]; // Scaling and FF for output 

i_out <= i[W2-2:W]; 
end 



//********* cemul with 3 mul. and 3 add/sub ************** 
assign sxtx = {x[W-l],x>; // Possible growth for 

assign sxty = {y[W-l],y}; // sub_l -> sign extension 

lpm_add_sub sub_l // Sub: x-y 

( .result (xmy) , . dataa(sxtx) , . datab(sxty) ) ; // Used ports 
// . add_sub(SUB) , .cout(crl), . overf low(ovll) , .cin(cinl), 

// . clken(clken) , .clock(clk), . aclr (aclr) ) ; //Unused 

defparam sub_l . lpm_width = Wl; 
defparam sub_l . lpm_representation = "SIGNED"; 
defparam sub_l . lpm_direction = "sub"; 



lpm_mult mul_l // Multiply (x-y)*c = xmyc 

( .dataa(xmy), .datab(c), . result (xmyc) ) ; // Used ports 
// .sum(sum), .clock(clk), . clken(clken) , . aclr (aclr) ) ; 

// Unused ports 

defparam mul_l . lpm_widtha = Wl; 

defparam mul_l . lpm_widthb = W; 

defparam mul_l . lpm_widthp = W2; 

defparam mul_l . lpm_widths = W2; 

defparam mul_l . lpm_representation = "SIGNED"; 
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lpm_mult mul_2 // Multiply (c-s)*y = cmsy 

( .dataa(cms), .datab(y), . result (cmsy )) ; // Used ports 
// .sum(sum), .clock(clk), . clken(clken) , . aclr (aclr ) ) ; 

// Unused ports 

defparam mul_2 . lpm_widtha = Wl; 

defparam mul_2 . lpm_widthb = ¥; 

defparam mul_2 . lpm_widthp = ¥2; 

defparam mul_2 . lpm_widths = ¥2; 

defparam mul_2 . lpm_representat ion = "SIGNED"; 

lpm_mult mul_3 // Multiply (c+s)*x = cpsx 

( .dataa(cps), .datab(x), . result (cpsx) ) ; // Used ports 
// .sum(sum), .clock(clk), . clken(clken) , . aclr (aclr) ) ; 

// Unused ports 

defparam mul_3 . lpm_widtha= ¥1; 

defparam mul_3 . lpm_widthb = ¥; 

defparam mul_3 . lpm_widthp = ¥2; 

defparam mul_3 . lpm_widths = ¥2; 

defparam mul_3 . lpm_representation = "SIGNED"; 

lpm_add_sub add_l // Add: r <= (x-y)*c + (c-s)*y 

( . dataa(cmsy) , . datab(xmyc) , . result (r)); // Used ports 

// . add_sub(ADD) , .cout(crl), . overf low(ovll) , .cin(cinl), 

// . clken(clken) , .clock(clk), . aclr (aclr) ) ; // Unused 
defparam add_l . lpm_width = ¥2; 
defparam add_l . lpm_representation = "SIGNED"; 
defparam add_l . lpm_direction = "add"; 

lpm_add_sub sub_2 // Sub: i <= (c+s)*x - (x-y)*c 

( . dataa(cpsx) , . datab(xmyc) , . result (i)); // Used ports 

// . add_sub(SUB) , .cout(crl), . overf low(ovll) , .clock(clk), 
// .cin(cinl), . clken(clken) , . aclr (aclr) ) ; // Unused 
defparam sub_2 . lpm_width = ¥2; 
defparam sub_2 . lpm_representation = "SIGNED"; 
defparam sub_2 . lpm_direction = "sub"; 

endmodule 

//******** ************************************************* 
// IEEE STD 1364-1995 Verilog file: bfproc.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 

//‘ include "220model.v" 

// ( include "ccmul.v" 
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module bfproc (elk, Are_in, Aim_in, Bre_in, Bim_in, c_in, 

cps_in, cms_in, Dre_out , Dim_out, Ere_out, Eim_out); 

parameter W2 = 17, // Multiplier bit width 

W1 = 9, // Bit width c+s sum 

W =8; // Input bit width 

input elk; // Clock for the output register 
input [¥-1:0] Are_in, Aim_in; // 8-bit inputs 

input [W-1:0] Bre_in, Bim_in, c_in; // 8-bit inputs 
input [Wl— 1 : 0] cps_in, cms_in; // 9-bit coefficients 
output [¥-1:0] Dre_out, Dim_out, Ere_out , Eim_out; 
reg [¥-1:0] Dre_out, Dim.out; // 8-bit registered 

// results 

reg [¥-1:0] dif_re, dif_im; // Bf out 

reg [¥-1:0] Are, Aim, Bre, Bim; // Inputs as integers 

reg [¥-1:0] c; // Input 

reg [¥1-1:0] cps , ems ; // Coefficient in 



always @(posedge elk) // Compute the additions of the 



begin 




// 


butterfly using integers 


Are 


<= Are_in; 


// 


and store inputs 




Aim 


<= Aim_in; 


// 


in flip-flops 




Bre 


<= Bre_in; 








Bim 


<= Bim_in; 








c 


<= c_in ; 




// Load from memory 


cos 


cps 


<= cps_in; 




// Load from memory 


cos+sin 


ems 


<= cms_in; 




// Load from memory 


cos-sin 



Dre_out <= ({Are [¥-l] , Are} + {Bre [W-l] ,Bre}) » 1; 

// i.e. Are/2 + Bre/2 

Dim_out <= ({Aim[¥-1] , Aim} + {Bim [W-l] ,Bim}) » 1; 
end // i.e. Aim/2 + Bim/2 

// No FF because butterfly difference "diff" is not an 
always @(Are or Bre or Aim or Bim) // output port 

begin 

dif_re = ({Are [¥-l] , Are} - {Bre [¥-l] ,Bre}) » 1; 

// i.e. Are/2 - Bre/2 

dif _im = ({Aim[¥-1] , Aim} - {Bim[¥-1] ,Bim}) » 1; 
end // i.e. Aim/2 - Bim/2 

//*** Instantiate the complex twiddle factor multiplier 
ccmul ccmul_l // Multiply (x+jy)(c+js) 

( .clk(clk), .x_in(dif_re) , . y_in(dif _im) , .c_in(c), 
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. cps_in(cps) , . cms_in(cms) , . r_out (Ere_out ) , 

. i_out (Eim_out ) ) ; 



endmodule 



// IEEE STD 1364-1995 Verilog file: lfsr.v 
// Author-EMAIL: Uwe.Meyer-Baese@ieee.org 
//********************************************************* 
module lfsr (elk, y) ; // > Interface 



input elk; 

output [6:1] y; // Result 



reg [6:1] f f ; // Note that reg is keyword in Verilog and 

// can not be variable name 



integer i; 



always @(posedge elk) begin // Length 6 LFSR with xnor 
ff[l] <= f f [5] ~~ ff[6]; // Use non-blocking assignment 
for (i=6; i>=2 ; i=i-l) // Tapped delay line: shift one 
ff [i] <= ff [i-1] ; 

end 

assign y = ff; // Connect to I/O cell 



endmodule 



// IEEE STD 1364-1995 Verilog file: Ifsr6s3.v 
// Author-EMAIL: Uwe.Meyer-Baese@ieee.org 
// ******** ************** *********************************** 
module Ifsr6s3 (elk, y) ; // > Interface 

input elk; 

output [6:1] y; // Result 

reg [6:1] ff; // Note that reg is keyword in Verilog and 

// can not be variable name 

always @(posedge elk) begin // Implement 3 step length 
f f [6] <= ff[3]; // 6 LFSR with xnor; use 

f f [5] <= ff[2]; // non-blocking assignments 
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ff [4] <= ff [1] ; 
ff [3] <= ff [5] ~~ ff [6] ; 
ff [2] <= ff [4] ~~ ff [5] ; 
ff [1] <= ff [3] ~~ ff [4] ; 
end 

assign y = f f ; 
endmodule 

// IEEE STD 1364-1995 Verilog file: anunod.v 
// Author-EMAIL: Uwe.Meyer-BaeseQieee.org 

/ / ^c^c3|c^c^c^(3|e^c^c3(c^(^3|c3(c^c^c^c%3|c^e^^c^c^c^c^c3|e^c%^c3|e^c^c3|c^^^e3|c3|c^c^c^c^c^c^c^(^c^c4e^3|c^c^e^c^(^c3|e 

module ammod (elk, r_in , phi_in, 

x_out, y_out, eps); // > Interface 

parameter W = 8; // Bit width - 1 

input elk; 

input [W:0] r_in, phi_in; 
output [W : 0] x_out , y_out , eps ; 
reg [W:0] x_out , y_out, eps; 

reg [W:0] r, phi; 

reg [W:0] xO, yO, zO; // There is no bit access in 2D 

reg [W:0] xl, yl, zl; // array types in Verilog, 

reg [W:0] x2, y2, z2; // therefore use single vectors 

reg [W:0] x3, y3, z3; 

always @(posedge elk) begin // > Infer register 

if (phi_in > 90) // Test for |phi_in| > 90 

begin // Rotate 90 degrees 

xO <= 0; 

yO <= r_in; // Input in register 0 

zO <= phi_in — , d90 ; 

end 

else if ((phi_in > 331) && (phi_in < 423)) 
begin 
xO <= 0; 
yO <= - r_in ; 
zO <= phi_in + , d90 ; 
end 
else 
begin 

xO <= r_in; 
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y0 <= 0; 
zO <= phi_in; 
end 

if (zO >0) // Rotate 45 degrees 

begin 

xl <= xO - yO; 

yl <= yO + xO; 

zl <= zO - >d45; 

end 
else 
begin 

xl <= xO + yO; 

yl <= yO - xO; 

zl <= zO + >d45; 

end 

if (zl > 0) // Rotate 26 degrees 

begin 

x2 <= xl - {yl [W] , y 1 [W : 1] > ; // i.e. xl - yl /2 

y2 <= yl + {xl [W] ,xl [W: 1] } ; // i.e. yl + xl /2 

z2 <= zl - >d26; 

end 
else 
begin 

x2 <= xl + {yl [W] ,yl [W: 1] } ; // i.e. xl + yl /2 

y2 <= yl - {xl [W] ,xl [W: 1] } ; // i.e. yl - xl /2 

z2 <= zl + , d26 ; 

end 

if (z2 >0) // Rotate 14 degrees 

begin 

x3 <= x2 - {y2 [W] , y2 [W] , y2 [W : 2] } ; // i.e. x2 - y2/4 

y3 <= y2 + {x2[W] ,x2[W] ,x2[W:2]>; // i.e. y2 + x2/4 

z3 <= z2 - >dl4; 

end 
else 
begin 

x3 <= x2 + {y2 [W] , y2 [W] , y2 [W: 2] } ; // i.e. x2 + y2/4 

y3 <= y2 - {x2[W] ,x2[W] ,x2[W:2]>; // i.e. y2 - x2/4 

z3 <= z2 + >dl4; 

end 



x_out <= x3; 
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eps <= z3; 
y_out <= y3; 
end 

endmodule 

// IEEE STD 1364-1995 Verilog file: fir.lms.v 
// Author-EMAIL : Uwe.Meyer-Baese@ieee.org 

// This is a generic FIR filter generator 
// It uses ¥1 bit data/coefficients bits 
module fir_lms 

(elk, x_in, d_in, e_out, y_out, fO_out, fl_out); 

parameter W1 = 8, // Input bit width 

¥2 = 16, // Multiplier bit width 2*W1 

L =2, // Filter length 

Delay = 3; // Pipeline steps of multiplier 
input elk; // 1 bit input 
input [Wl-1:0] x_in, d_in; // Inputs 
output [¥2-1:0] e_out, y_out ; // Results 

output [¥1-1:0] fO_out, fl_out; // Results 

// 2D array types i.e. memories not supported by MaxPlusII 
// in Verilog, use therefore single vectors 

reg [¥1-1:0] x, xO, xl, fO, fl; // Coefficient array 
reg [¥1-1:0] d; 
wire [¥1-1:0] emu; 

wire [¥2-1:0] pO, pi, xemuO, xemul; // Product array 
wire [¥2-1:0] y, sxty, e, sxtd; 

wire elken, aclr; 

wire [¥2-1:0] sum; // Auxilary signals 



assign sum=0; assign aclr=0; // Default for mult 
assign clken=0; 



// 16 bit signed extension for input d 
assign sxtd = {{8{d[7]>}, d>; 

always @(posedge elk) // Store these data or coefficients 
begin: Store 
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d <= d_in; // Store desired signal in register 
xO <= x_in; // Get one data sample at a time 
xl <= xO; // shift 1 

fO <= fO + xemuO [15 : 8] ; // implicit divide by 2 
fl <= fl + xemul[15:8]; 

end 

// Instantiate L pipelined multiplier 
// Multiply p(i) = f (i) * x(i); 

lpm_mult mul_0 // Multiply xO*fO = pO 

( . dataa(xO) , .datab(fO), . result (pO) ) ; 

// .clock(clk), .sum(sum), 

// . clken(clken) , . aclr (aclr) ) ; // Unused ports 

defparam mul_0 . lpm_widtha = Wl; 
defparam mul_0 . lpm_widthb = Wl; 
defparam mul_0 . lpm.widthp = W2; 
defparam mul_0 . lpm_widths = W2; 

// defparam mul_0 . lpm_pipeline = Delay; 

defparam mul_0 . lpm_representation = "SIGNED"; 

lpm_mult mul_l // Multiply xl*fl = pi 

( .dataa(xl), .datab(fl), . result (pi) ) ; 

// .clock(clk), .sum(sum), 

// . clken(clken) , . aclr (aclr) ) ; // Unused ports 

defparam mul_l . lpm_widtha = Wl; 
defparam mul_l . lpm.widthb = Wl; 
defparam mul_l . lpm_widthp = W2; 
defparam mul_l . lpm_widths = W2; 

// defparam mul_l . lpm_pipeline = Delay; 

defparam mul_l . lpm_representation = "SIGNED"; 



assign y = pO + pi; // Compute ADF output 

// Scale y by 128 because x is fraction 
assign sxty = { {7{y[15]}>, y[15:7]>; 

assign e = sxtd - sxty; 

assign emu = e [8 : 1] ; // e*mu divide by 2 and 

// 2 from xemu makes mu=l/4 
// Instantiate L pipelined multiplier 
// Multiply xemu(i) = emu * x(i); 

lpm.mult mul_3 // Multiply xemuO = emu * xO; 
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( .dataa(xO), .datab(emu), . result (xemuO) ) ; 

// .clock(clk), .sum(sum), 

// . clken(clken) , . aclr (aclr) ) ; // Unused ports 

defparam mul_3 . lpm_widtha = Wl; 
defparam mul_3 . lpm_widthb = Wl; 
defparam mul_3 . lpm.widthp = W2; 
defparam mul_3 . lpm_widths = W2; 

// defparam mul_3 . lpm.pipeline = Delay; 

defparam mul_3 . lpm_representation = "SIGNED"; 

lpm_mult mul_4 // Multiply xemul = emu * xl; 

( .dataa(xl), .datab(emu), .result(xemul) ) ; 

// .clock(clk), .sum(sum), 

// . clken(clken) , . aclr (aclr) ) ; // Unused ports 

defparam mul_4 . lpm_widtha = Wl; 
defparam mul_4 . lpm_widthb = Wl; 
defparam mul_4 . lpm_widthp = W2; 
defparam mul_4 . lpm_widths = W2; 

// defparam mul_4 . lpm_pipeline = Delay; 

defparam mul_4 . lpm_representation = "SIGNED"; 



assign y_out = y; // Monitor some test signals 
assign e_out = e; 
assign fO_out = fO; 
assign fl_out = fl; 

endmodule 

//********************************************************* 
// IEEE STD 1364-1995 Verilog file: fir6dlms.v 
// Author-EMAIL: Uwe.Meyer-Baese6ieee.org 

//^e + ***3»C*j|e****3|C3|CJtCS|C**j|C*5|C**5»C*5(C****j|C3|C5|c*3|C****3|C3|C**3|C************* 

// This is a generic DFIR filter generator 
// It uses Wl bit data/coefficients bits 
module fir6dlms 

(elk, x_in, d_in, e_out, y_out, fO_out, fl_out); 

parameter Wl = 8, // Input bit width 

W2 = 16, // Multiplier bit width 2*W1 

L =2, // Filter length 

Delay =3; // Pipeline steps of multiplier 
input elk; // 1 bit input 
input [Wl-1:0] x_in, d_in; // Inputs 
output [W2-l:0] e_out, y_out ; // Results 




484 A. Verilog Source Code 



output [Wl— 1 : 0] f0_out, fl_out; // Results 

// 2D array types i.e. memories not supported by MaxPlusII 
// in Verilog, use therefore single vectors 
reg [Wl-1:0] x, xO, xl, x2, x3, x4, fO, fl; 
reg [Wl-1:0] dO, dl, d2, d3; // Desired signal array 
wire [Wl-1:0] emu; 

wire [¥2-1:0] pO, pi, xemuO, xemul; // Product array 
wire [¥2-1:0] y, sxty, e, sxtd; 

wire clken, aclr; 

wire [¥2-1:0] sum; // Auxilary signals 



assign sum=0; assign aclr=0; // Default for mult 
assign clken=0; 



// 16 bit signed extension for input d 
assign sxtd = {{8{d3[7]}}, d3}; 



always <D(posedge elk) // Store these data or coefficients 
9gin: Store 

dO <= d_in; // Shift register for desired data 



begin: Store 


dO 


<= d_in 


dl 


<= dO; 


d2 


iH 

II 

V 


d3 


CM 

II 

V 


xO 


<= x_in 


xl 


A 

ii 

X 

o 


x2 


<= xl; 


x3 


A 

n 

X 

to 


x4 


<= x3; 


fO 


+ 

o 

<H 

II 

V 


f 1 


<= fl + 



fO <= fO + xemuO [15: 8]; // implicit divide by 2 



end 



// Instantiate L pipelined multiplier 
// Multiply p(i) = f (i) * x(i); 

lpm_mult mul_0 // Multiply x0*f0 = pO 

( . clock(clk) , .dataa(xO), .datab(fO), . result (pO) ) ; 

// .sum(sum), . clken(clken) , . aclr (aclr) ) ; // Unused ports 
defparam mul_0 . lpm_widtha = ¥1; 
defparam mul_0 . lpm_widthb = ¥1; 
defparam mul_0 . lpm_widthp = ¥2; 
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defparam mul_0 . lpm_widths = ¥2; 
defparam mul_0 . lpm_pipeline = Delay; 
defparam mul_0 . lpm_representation = "SIGNED"; 

lpm_mult mul_l // Multiply xl*fl = pi 

( . clock(clk) , .dataa(xl), .datab(fl), . result (pi) ) ; 

// .sum(sum), . clken(clken) , . aclr (aclr) ) ; // Unused ports 
defparam mul_l . lpm_widtha = W 1 ; 
defparam mul_l . lpm_widthb = Wl; 
defparam mul_l . lpm.widthp = W2; 
defparam mul_l . lpm_widths = ¥2; 
defparam mul_l . lpm_pipeline = Delay; 
defparam mul_l . lpm_representation = "SIGNED"; 



assign y = pO + pi; // Compute ADF output 

// Scale y by 128 because x is fraction 
assign sxty = { {7{y[15]}}, y[15:7]}; 

assign e = sxtd - sxty; 

assign emu = e [8 : 1] ; // e*mu divide by 2 and 

// 2 from xemu makes mu=l/4 
// Instantiate L pipelined multiplier 
// Multiply xemu(i) = emu * x(i); 

lpm_mult mul_3 // Multiply xemuO = emu * xO; 

( . clock(clk) , . dataa(x3) , .datab(emu) , . result (xemuO) ) ; 
// .sum(sum), . clken(clken) , . aclr (aclr) ) ; // Unused ports 
defparam mul_3 . lpm_widtha = ¥1; 
defparam mul_3 . lpm_widthb = ¥1; 
defparam mul_3 . lpm_widthp = ¥2; 
defparam mul_3 . lpm_widths = ¥2; 
defparam mul_3 . lpm_pipeline = Delay; 
defparam mul_3 . lpm_representation = "SIGNED"; 

lpm_mult mul_4 // Multiply xemul = emu * xl; 

( . clock(clk) , . dataa(x4) , .datab(emu) , . result (xemul) ) ; 
// . sum(sum) , . clken(clken) , . aclr (aclr) ) ; // Unused ports 

defparam mul_4 . lpm_widtha = ¥1; 
defparam mul_4 . lpm_widthb = ¥1; 
defparam mul_4 . lpm_widthp = ¥2; 
defparam mul_4 . lpm_widths = ¥2; 
defparam mul_4 . lpm_pipeline = Delay; 
defparam mul_4 . lpm_representation = "SIGNED"; 
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assign y_out = y; // Monitor some test signals 
assign e_out = e; 
assign fO_out = fO; 
assign fl_out = fl; 

endmodule 




B. VHDL and Verilog Coding 



Unfortunately, today we find two HDL languages are popular. The US west 
coast and Asia prefer Verilog, while the US east coast and Europe more fre- 
quently use VHDL. For digital signal processing with FPGAs, both languages 
seem to be well suited, but some VHDL examples are a little easier to read 
because of the supported signed arithmetic and multiply /divide operations in 
the IEEE VHDL 1076-1987 and 1076-1993 standards. This gap will disappear 
when the Verilog IEEE standard 1364-1999 is approved, as it will also include 
signed arithmetic. Other constraints may include personal preferences, EDA 
library and tool availability, data types, readability, capability, and language 
extensions using PLIs, as well as commercial, business and marketing issues, 
to name just a few. A detailed comparison can be found in the book by 
Smith [3] . Tool providers acknowledge today that both languages need to be 
supported. 

It is therefore a good idea to use an HDL code style that can easily be 
translated into either language. An important rule is to avoid any “keyword” 
in both languages in the HDL code when naming variables, labels, constants, 
user types, etc. The IEEE standard VHDL 1076-1987 uses 77 keywords and 
an extra 19 keywords are used in VHDL 1076-1993 (see VHDL 1076-1993 
Language Reference Manual (LRM) on p. 179). New in VHDL 1076-1993 
are: 

GROUP, IMPURE, INERTIAL, LITERAL, POSTPONED, PURE, REJECT 
ROL, ROR, SHARED, SLA, SLL, SRA, SRL, UNAFFECTED, XNOR, 

which are unfortunately not highlighted in the MaxPlusII editor. The IEEE 
standard Verilog 1364-1995, on the other hand, has 102 keywords (see LRM, 
p. 604). Together, both HDL languages have 182 keywords, including 17 
in common. Table B.l shows VHDL 1076-1993 keywords in capital letters, 
Verilog 1364-1995 keywords in small letters, and the common keywords with 
a capital first letter. 
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Table B.l. VHDL 1076-1993 and Verilog 1364-1995 keywords. 



ABS 

ACCESS 

AFTER 

ALIAS 

ALL 

always 

And 

ARCHITECTURE 

ARRAY 

ASSERT 

assign 

ATTRIBUTE 

Begin 

BLOCK 

BODY 

buf 

BUFFER 
buf if 0 
buf if 1 
BUS 
Case 
casex 
casez 
cmos 

COMPONENT 

CONFIGURATION 

CONSTANT 

deassign 

default 

defparam 

disable 

DISCONNECT 

DOWNTO 

edge 

Else 

ELSIF 

End 

endcase 

endf unction 

endmodule 

endprimitive 

endspecify 

endtable 

endtask 

ENTITY 



event 

EXIT 

FILE 

For 

force 

forever 

fork 

Function 

GENERATE 

GENERIC 

GROUP 

GUARDED 

highzO 

highzl 

If 

ifnone 

IMPURE 

IN 

INERTIAL 

initial 

Inout 

input 

integer 

IS 

join 

LABEL 

large 

LIBRARY 

LINKAGE 

LITERAL 

LOOP 

macromodule 

MAP 

medium 

MOD 

module 

Nand 

negedge 

NEW 

NEXT 

nmos 

Nor 

Not 

not if 0 
not if 1 
NULL 



OF 

ON 

OPEN 

Or 

OTHERS 

OUT 

output 

PACKAGE 

parameter 

pmos 

PORT 

posedge 

POSTPONED 

primitive 

PROCEDURE 

PROCESS 

pul 10 

pull 1 

pulldown 

pullup 

PURE 

RANGE 

rcmos 

real 

realtime 

RECORD 

reg 

REGISTER 

REJECT 

release 

REM 

repeat 

REPORT 

RETURN 

rnmos 

ROL 

ROR 

rpmos 

rtran 

rtranif 0 

rtranif 1 

scalared 

SELECT 

SEVERITY 

SHARED 

SIGNAL 



OF 

SLA 

SLL 

small 

specify 

specparam 

SRA 

SRL 

strongO 

strongl 

SUBTYPE 

supply 0 

supply 1 

table 

task 

THEN 

time 

TO 

tran 

tranif 0 

tranif 1 

TRANSPORT 

tri 

triO 

tril 

triand 

trior 

trireg 

TYPE 

UNAFFECTED 

UNITS 

UNTIL 

USE 

VARIABLE 

vectored 

Wait 

wand 

weakO 

weakl 

WHEN 

While 

wire 

WITH 

wor 

Xnor 

Xor 
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B.l List of Examples 



The following table displays the results for all VHDL and Verilog examples 
given in this book. 



Design 


LCs 


VHDL 
MHz Page 


LCs 


Verilog 
MHz Page 


add_lp 


26 


63.29 


55 


26 


63.29 


437 


add_2p 


58 


63.29 


54 


58 


63.29 


439 


add_3p 


105 


60.97 


54 


105 


60.97 


442 


ammod 


279 


28.49 


343 


280 


35.08 


479 


bfproc 


531 


13.56 


268 


542 


14.10 


476 


ccmul 


493 


- 


265 


504 


- 


474 


cic3r32 


401 


40.0 


191 


404 


35.58 


465 


cic3s32 


238 


44.64 


199 


230 


42.37 


466 


cordic 


244 


39.68 


99 


245 


37.17 


449 


daf sm 


37 


56.17 


128 


37 


55.86 


454 


dapara 


39 


31.84 


140 


39 


33.55 


461 


darom 


34 


28.01 


135 


34 


28.24 


457 


dasign 


65 


33.44 


137 


65 


35.58 


459 


db41atti 


331 


45.24 


225 


321 


53.76 


470 


db4poly 


208 


78.74 


180 


191 


74.62 


468 


div_aegp 


469 


14.59 


67 


479 


14.59 


446 


div_res 


179 


37.59 


74 


142 


34.96 


448 


example 


25 


125.00 


14 


25 


125.00 


435 


f ir6dlms 


658 


23.41 


399 


658 


23.41 


483 


f ir_gen 


892 


41.66 


111 


882 


41.32 


451 


f ir_lms 


612 


9.00 


392 


612 


9.00 


481 


f ir_srg 


97 


17.45 


123 


97 


17.45 


453 


f un_text 


32 


54.64 


23 


32 


54.64 


436 


iir 


31 


42.91 


149 


31 


39.84 


462 


iir_par 


215 


31.34 


169 


217 


24.63 


464 


iir_pipe 


64 


49.75 


163 


64 


49.75 


463 


If sr 


6 


45.45 


325 


6 


45.45 


478 


If sr6s3 


6 


43.85 


328 


6 


43.85 


478 


mul_ser 


115 


41.15 


58 


139 


39.52 


445 


rader7 


486 


23.04 


253 


497 


23.86 


472 



The following option of MaxPlusII version 10.2 were used: 

• Global Project Synthesis Style to FAST. 

• Assigns Global Project Logic Synthesis option to Optimize 10 

(Speed). 

• Assign-^ Global Project Logic Synthesis— >- Automatic Fast I/O. 

• Assign-4- Device for Device Family, option FLEX10K. 
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• Devices -> EPF10KT0RC240-4. 

• Fitter Settings — > Use Quartus Fitter for FLEX 10K and ACEX IK 
Devices. 

The data in MHz are the Registered Performance (from the “timing an- 
alyzer output” files, i.e., *.tao) for the designs. The table is structured as 
follows: the first column shows the “entity” or module name of the design. 
Columns 2 to 4 are data for the VHDL designs: the number of LCs shown 
in the report file (*.rpt); Registered Performance; and the page with the 
source code. The same data are provided for the Verilog design examples, 
shown in columns 5 to 8. Comparing the VHDL and Verilog synthesis re- 
sults, we see that some data are not exactly identical. These results can be 
easily reproduced by using the scripts maxplus . bat in the VHDL or Verilog 
directories of the CD-ROM, and an “grep” through the report file (*.rpt) 
and the timing analyzer output files (*.tao). 



B.2 Library of Parameterized Modules (LPM) 

Throughout the book we use five different LPM megafunctions (see Fig. B.l), 
namely: 

• lpm_ff , the flip-flop megafunction 

• lpm_add_sub, the adder/subtractor megafunction 

• lpm_rom, the ROM megafunction 

• lpm_divide, the divider megafunction, and 

• lpm_mult, the multiplier megafunction 

These megafunctions are explained in the following, along with their port 
definitions, parameters, and resource usage. This information is also available 
using the MaxPlusII help under VHDL — > Megaf unctions/LPM, or in the “LPM 
Quick Reference Guide,” located on the Altera digital library CD-ROM as 
PDF document literature/catalogs/lpm.pdf . 

B.2.1 The Parameterized Flip-flop Megafunction (lpmJT) 

The lpm_f f function is useful if features are needed that are not available 
in the DFF, DFFE, TFF, and TFFE primitives, such as synchronous or asyn- 
chronous set, clear, and load inputs. We have used this megafunction for the 
following designs: example, p. 14, fun_text, p. 23, add_lp, p. 55, add_2p, 
p. 54, add_3p, p. 54 as well as for the components csa7 and csa7cin. 

Altera recommends instantiating this function as described in “Creating 
a Custom Megafunction Variation with the MegaWizard Plug-In Manager.” 
The port names and order for Verilog HDL prototypes are: 

module lpm_ff ( q, data, clock, enable, aclr, 

aset, sclr, sset, aload, sload) ; 
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Fig. B.l. Five LPM megafunction used. 



The VHDL component declaration is shown below: 

COMPONENT lpm_f f 

GENERIC (LPM.WIDTH: POSITIVE; 

LPM.AVALUE: STRING := H UNUSED H ; 

LPM.FFTYPE: STRING := "FFTYPE_DFF" ; 

LPM.TYPE: STRING := "L_FF"; 

LPM_SVALUE : STRING := M UNUSED M ; 

LPM.HINT: STRING := "UNUSED"); 

PORT (data: IN STD_L0GIC_VECT0R(LPM_WIDTH-1 DOWNTO 0) ; 
clock: IN STD_L0GIC ; 

enable: IN STD.LOGIC := > 1 > ; 

sload : IN STD_L0GIC := > 0 > ; 

sclr : IN STD_L0GIC := ’O'; 

sset : IN STD_L0GIC := ’O’; 

aload: IN STD .LOGIC := ’O’; 





492 B. VHDL and Verilog Coding 



aclr : IN STD.LOGIC := ’O’; 
aset: IN STD_L0GIC := ’O’ ; 

q: OUT STD_L0GIC_VECT0R(LPM_WIDTH-1 DOWNTO 0)); 
END COMPONENT; 



Ports 

The following table displays all INPUT ports of lpm_f f : 



Port 

name 


Re- 
quired Description 


Comments 


data 


No 


T-type flip-flop: 
Toggle enable 
DD-type flip-flop: 
Data input 


Input port LPM_WIDTH wide. If the data in- 
put is not used, at least one of the aset , 
aclr, sset, or sclr ports must be used. 
Unused data inputs default to GND. 


clock 


Yes 


Positive-edge- 
triggered clock 




enable 


No 


Clock Enable input 


Default = 1. 


sclr 


No 


Synchronous clear in- 
put 


If both sset and sclr are used and both 
are asserted, sclr is dominant. The sclr 
signal affects the output q values before po- 
larity is applied to the ports. 



sset 



No 



Synchronous set in- 
put 



Sets q outputs to the value specified by 
LPM_S VALUE, if that value is present, or sets 
the q outputs to all Is. If both sset and 
sclr are used and both are asserted, sclr 
is dominant. The sset signal affects the 
output q values before polarity is applied 
to the ports. 



Synchronous load in- Default = 0. If sload is used, data must 
put. Loads the flip- be used. For load operation, sload must 
sload No the value be high (1) and enable must be high (1) 

on the data input on or unconnected. The sload port is ignored 
the next active clock when the LPM_FFTYPE parameter is set to 
edge. “DFF.” 



aclr 



If both aset and aclr are used and both 
Asynchronous clear are asserted, aclr is dominant. The aclr 
input signal affects the output q values before po- 

larity is applied to the ports. 



aset 



No 



Asynchronous set in- 
put 



Sets q outputs to the value specified by 
LPM_AVALUE, if that value is present, or sets 
the q outputs to all Is. 



Asynchronous 
load input. Asyn- 

M chronously loads the Default = 0. If aload is used, data must 
al0ad No flip-flop with the be used, 
value on the data 
input . 




B.2 Library of Parameterized Modules (LPM) 493 



The following table displays all OUTPUT ports of lpm_ff : 



Port 

Name 


Re- 
quired Description 


Comments 


q 


Data output from 
D or T flip-flops 


Output port LPM.WIDTH 
wide 



Parameters 

The following table shows the parameters of the lpm_ff component: 



Parameter 


Type 


Re- 
quired Description 


LPM_ 


.WIDTH 


Integer 


Yes 


Width of the data and q ports 


LPM_ 


.AVALUE 


Integer 


No 


Constant value that is loaded when 
aset is high. If omitted, defaults to 
all Is. The LPM_AVALUE parameter is 
limited to a maximum of 32 bits. 


LPM. 


.SVALUE 


Integer 


No 


Constant value that is loaded on the 
rising edge of clock when sset is 
high. If omitted, defaults to all Is. 


LPM. 


.FFTYPE 


String 


No 


Values are “DFF,” “TFF,” and 
“UNUSED.” Type of flip-flop. If omit- 
ted, the default is “DFF.” When 
the LPM_FFTYPE parameter is set to 
“DFF,” the sload port is ignored. 


LPM. 


.HINT 


String 


No 


Allows you to specify Altera-specific 
parameters in VHDL design files. 
The default is “UNUSED.” 


LPM. 


.TYPE 


String 


No 


Identifies the LPM entity name in 
VHDL design files. 



Note that for Verilog LPM 220 synthesizable code (i.e., 220model.v) the 
following parameter ordering applies: lpm_type, lpm_width, lpm_avalue, 
lpm_svalue, lpm_pvalue, lpm_fftype, lpm_hint. 



Function 



The following table is an example of the T-type flip-flop behavior in lpm_ff : 
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Inputs 

aclr aset enable clock sclr sset sload 


Outputs 

q[LPM_WIDTH-l. .0] 


1 


X 


X 


X 


X 


X 


X 


000... 


0 


1 


X 


X 


X 


X 


X 


111... or LPM.AVALUE 


0 


0 


0 


X 


X 


X 


X 


q[LPM_WIDTH-l . .0] 


0 


0 


1 


J 


1 


X 


X 


000... 


0 


0 


1 


J 


0 


1 


X 


111... or LPM_SVALUE 


0 


0 


1 


J 


0 


0 


1 


data[LPM_WIDTH-l . .0] 


0 


0 


1 


J 


0 


0 


0 


q [LPM_WIDTH- 1 . .0] 
xor data[LPM_WIDTH-l. .0] 



Resource Usage 

The megafunction lpm_ff uses one logic cell per bit. 



B.2.2 The Parameterized Adder/Subtractor Megafunction 
(lpm^add_sub) 

Altera recommends using the lpm_add_sub function to replace all other types 
of adder /subtractor functions, including old-style adder /subtractor macro- 
functions. We have used this megafunction for the following designs: example, 
p. 14, fun.text, p. 23, add_lp, p. 55, ccmul, p. 265, as well as for the com- 
ponents add_ff8, add_ff 8cin, csa7, and csa7cin. 

Altera recommends instantiating this function as described in “Creating 
a Custom Megafunction Variation with the MegaWizard Plug-In Manager.” 
The port names and order for Verilog HDL prototypes are: 

module lpm_add_sub ( cin, 

dataa, datab, 
add_sub, clock, aclr, 
result, cout, overflow); 

The VHDL component declaration is shown below: 

COMPONENT lpm_add_sub 

GENERIC (LPM_WIDTH: POSITIVE; 

LPM.REPRESENTATION: STRING := ’’SIGNED" ; 
LPM_DIRECTION : STRING := "UNUSED"; 

LPM.HINT: STRING := "UNUSED"; 

LPM.PIPELINE: INTEGER := 0; 

LPM.TYPE: STRING := "L_ADD_SUB") ; 

PORT (dataa, datab 

: IN STD_L0GIC_VECT0R(LPM_WIDTH-1 D0WNT0 0); 
aclr, clken, clock, cin : IN STD_L0GIC := > 0 ) ; 
add_sub : IN STD_L0GIC := ’l’; 
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result : OUT STD_L0GIC_VECT0R(LPM_WIDTH-1 DOWNTO 0); 
cout, overflow : OUT STD_L0GIC); 

END COMPONENT; 



Ports 

The following table displays all INPUT ports of lpm_add_sub: 



Port ^ e_ 

name quired Description Comments 



Carry-in to the low- 
order bit. If the op- 
eration is “ADD,” low If omitted, the default is 0 (i.e., low if 
cin No = 0 and high = +1. the operation is “ADD” and high if the 

If the operation is operation is “SUB”). 

“SUB,” low = — 1 and 
high = 0. 

dataa Yes Augend/Minuend Input port LPM_WIDTH wide 



datab Yes Addend/ Subtrahend Input port LPM_WIDTH wide 



If the LPM_DIRECTI0N parameter is 

If the signal is high, used, add_sub cannot be used. If omit- 

the operation = ted, the default is “ADD.” Altera recom- 

,, at dataa + datab. If mends that you use the LPM DIRECTION 

add_sub INo . _ . _ , , . r . . r 

the signal is low, the parameter to speedy the operation of 

operation = dataa — the lpm_add_sub function, rather than 
datab. assigning a constant to the add_sub 

port. 



clock 



clken 



aclr 



No 



No 



Clock for pipelined 
usage 



The clock port provides pipelined op- 
eration for the lpm_add_sub function. 
For LPM_PIPELINE values other than 0 
(default value), the clock port must be 
connected. 



Clock enable 
pipelined usage 



for Available for VHDL only 



No 



Asynchronous clear 
for pipelined usage 



The pipeline initializes to an undefined 
(X) logic level. The aclr port can be 
used at any time to reset the pipeline 
to all Os, asynchronously to the clock 
signal. 
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The following table displays all OUTPUT ports of lpm_add_sub: 



Port 

Name 


Re- 
quired Description 


Comments 


result 


Yes 


dataa + or — 
datab + or — cin 


Output port LPM_WIDTH 
wide 


cout 


No 


Carry-out (borrow- 
in) of the MSB 


If overflow is used, cout 
cannot be used. The cout 
port has a physical inter- 
pretation as the carry-out 
(borrow-in) of the MSB. 



cout is most meaningful 
for detecting overflow in 
“UNSIGNED” operations. 



If overflow is used, 
cout cannot be used. The 
overflow port has a phys- 
ical interpretation as the 
XOR of the carry-in to the 
MSB with the carry-out 
of the MSB. overflow is 
meaningful only when the 
LPM_REPRESENTATION 
parameter value is 
“SIGNED.” 



overflow No 



Result exceeds 
available precision. 



Parameters 

The following table shows the parameters of the lpm_add_sub component 
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Parameter 


Type 


Re- 
quired Description 


LPM_WIDTH 


Integer 


Yes 


Width of the dataa, datab, and result 
ports. 


LPH_DIRECTION String 


No 


Values are “ADD,” “SUB,” and “UNUSED.” If 
omitted, the default is “DEFAULT,” which 
directs the parameter to take its value 
from the add_sub port. The add_sub 
port cannot be used if LPM_DIRECTION is 
used. Altera recommends that you use the 
LPM_DIRECTION parameter to specify the 
operation of the lpm_add_sub function, 
rather than assigning a constant to the 
add_sub port. 


LPM_ 

-REPRESEN- 

TATION 


String 


No 


Type of addition performed: “SIGHED,” 
“UNSIGNED,” or “UNUSED.” If omitted, the 
default is “SIGNED.” 


LPM.PIPELINE 


Integer 


No 


Specifies the number of clock cycles of la- 
tency associated with the result output. 
A value of zero (0) indicates that no la- 
tency exists, and that a purely combinato- 
rial function will be instantiated. If omit- 
ted, the default is 0 (nonpipelined). 


LPM_HINT 


String 


No 


Allows you to specify Altera-specific pa- 
rameters in VHDL design files. The default 

is “UNUSED.” 


LPM_TYPE 


String 


No 


Identifies the LPM entity name in VHDL 
design files. 


0NE_INPUT_ 

_IS_CONSTANT 


String 


No 


Altera-specific parameter. Values are 
“YES,” “NO,” and “UNUSED.” Provides 
greater optimization, if one input is 
constant. If omitted, the default is “NO.” 


MAXIMIZE. 

-SPEED 


Integer 


No 


Altera-specific parameter. You can spec- 
ify a value between 0 and 10. If used, 
MaxPlusII attempts to optimize a spe- 
cific instance of the lpm_add_sub function 
for speed rather than area, and overrides 
the setting of the Optimize option in the 
Global Project Logic Synthesis dialog 
box (Assign menu). If MAXIMIZE.SPEED 
is unused, the value of the Optimize op- 
tion is used instead. If the setting for 
MAXIMIZE.SPEED is 6 or higher, the com- 
piler will optimize lpm_add_sub megafunc- 
tions for higher speed; if the setting is 5 or 
less, the compiler will optimize for smaller 
area. 
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Note that for Verilog LPM 220 synthesizable code (i.e., 220model.v) the 
following parameter ordering applies: lpm_type, lpm_width, lpm_direction, 
lpm_representation, lpm_pipeline, lpm_hint. 



Function 

The following table is an example of the UNSIGNED behavior in 

lpm_add_sub: 



Inputs 

add_sub dataa datab 


Outputs 

cout, result overflow 


1 a b 

0 a b 


a + b + cin cout 
a - b - cin ! cout 



The following table is an example of the SIGNED behavior in lpm_add_sub: 



Inputs 

add_sub dataa 


datab 


cout , sum 


Outputs 

overflow 


1 a 


b 


a -f 6-f-cin 


a > 0 and b > 0 
and sum < 0 
or a < 0 and b < 0 and 
sum > 0 


0 a 


b 


a — 6— cin 


a >= 0 and b < 0 
and sum < 0 
or a < 0 and b > 0 
and sum > 0 



Resource Usage 

The following table summarizes the resource usage for an lpm_add_sub mega- 
function used to implement a 16-bit unsigned adder with a carry-in input and 
a carry-out output. Logic cell usage scales linearly in proportion to adder 
width. 



Design goals 




Design 


results 


Device family 


Optimization 


LCs 


Speed (ns) Notes 


FLEX 6K, 8K, 


Routability 


45 


53 


Speed for 


and 10K 


Speed 


18 


17 


EPF8282A-2 


MAX 5K, 7K, 


Routability 


28 (22) 


23 


Speed for 


and 9K 








EPM7128E-7 



Numbers of shared expanders used are shown in parentheses. 




B.2 Library of Parameterized Modules (LPM) 499 



B.2.3 The Parameterized Multiplier Megafunction 
(lpm_mult) 

Altera recommends that you use lpm_mult to replace all other types of mul- 
tiplier functions, including old-style multiplier macrofunctions. We have used 
this megafunction for the designs f ir_gen, p. Ill, ccmul, p. 265. f ir_lms, 
p. 392, and f ir6dlms, p. 399. 

Altera recommends instantiating this function as described in “Creating 
a Custom Megafunction Variation with the MegaWizard Plug-In Manager.” 
The port names and order for Verilog HDL prototype are: 

module lpm_mult ( dataa, datab, sum, aclr, clock, 
result) ; 

The VHDL component declaration is shown below: 

COMPONENT lpm__mult 

GENERIC (LPM_WIDTHA : POSITIVE; 

LPM_WIDTHB : POSITIVE; 

LPM_WIDTHS : POSITIVE; 

LPM.WIDTHP: POSITIVE; 

LPM_REPRESENTATION : STRING := "UNSIGNED"; 
LPM_PIPELINE : INTEGER := 0; 

LPM.TYPE: STRING := "L.MULT"; 

LPM_HINT : STRING := "UNUSED"); 

PORT (dataa : IN STD_L0GIC_VECT0R(LPM_WIDTHA-1 D0WNT0 0); 
datab : IN STD_L0GIC_VECT0R(LPM_WIDTHB-1 D0WNT0 0); 
aclr, clken, clock : IN STD__L0GIC := ’O’; 
sum : IN STD_L0GIC_VECT0R(LPM_WIDTHS-1 D0WNT0 0) 

:= (OTHERS => > 0 >); 
result: OUT STD_L0GIC_VECT0R(LPM_WIDTHP-1 D0WNT0 0) 

); 

END COMPONENT; 



Ports 

The following table displays all INPUT ports of lpm_mult: 
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Port 

name 


Re- 
quired Description 


Comments 


dataa 


Yes 


Multiplicand 


Input port LPM_WIDTHA wide 


datab 


Yes 


Multiplier 


Input port LPM_WIDTHB wide 


sum 


No 


Partial sum 


Input port LPM.WIDTHS wide 


clock 


No 


The clock port provides pipelined op- 
Clock for pipelined eration for the lpra_mult function. For 
usage LPM_PIPELIME values other than 0 (default 

value), the clock port must be connected. 


clken 


No 


Clock enable 
pipelined usage 


f° r Available for VHDL only. 


aclr 


No 


The pipeline initializes to an undefined (X) 
Asynchronous clear logic level. The aclr port can be used at 
for pipelined usage any time to reset the pipeline to all Os, 
asynchronously to the clock signal. 



The following table displays all OUTPUT ports of lpm_mult: 



Port Re- 

Name quired Description Comments 

Output port LPM_WIDTHP 
wide. If LPM_WIDTHP 
< max (LPM.WIDTHA + 
LPM_WIDTHB, LPM.WIDTHS) 
or (LPM_¥IDTHA + 

LPM_WIDTHS) , only the 

LPM_WIDTHP MSBs are 

present. 



result = dataa * 
datab + sum. The 
result Yes product LSB is 
aligned with the 
sum LSB. 



Parameters 

The following table shows the parameters of the lpm_mult component: 
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Parameter 


Type 


Re- 
quired Description 


LPM.WIDTHA 


Integer 


Yes 


Width of the dataa port 


LPM.WIDTHB 


Integer 


Yes 


Width of the datab port 


LPM.WIDTHP 


Integer 


Yes 


Width of the result port 


LPM_WIDTHS 


Integer 


Yes 


Width of the sum port. Required even 
if the sum port is not used. 


LPM. 

.REPRESENTATION String 


No 


Type of multiplication performed: 
“SIGNED,” “UNSIGNED,” or “UNUSED.” If 
omitted, the default is “UNSIGNED.” 


LPM.PIPELINE 


Integer 


No 


Specifies the number of clock cycles of 
latency associated with the result out- 
put. A value of zero (0) indicates that 
no latency exists, and that a purely 
combinatorial function will be instanti- 
ated. If omitted, the default is 0 (non- 
pipelined). 


LPM_HINT 


String 


No 


Allows you to assign Altera-specihc pa- 
rameters in VHDL Design Files. The 
default is “UNUSED.” 


LPM.TYPE 


String 


No 


Identifies the LPM entity name in 
VHDL Design Files. 


INPUTS. 

_IS_C0NSTANT 


String 


No 


Altera-specihc parameter. Values are 
“YES,” “NO,” and “UNUSED.” If dataa is 
connected to a constant value, setting 
INPUT_A_IS_CONSTANT to “YES” opti- 
mizes the multiplier for resource usage 
and speed. If omitted, the default is 
“NO.” 


INPUT _B_ 
_IS_C0NSTANT 


String 


No 


Altera-specihc parameter. Values are 
“YES,” “NO,” and “UNUSED.” If datab is 
connected to a constant value, setting 
INPUT_B_IS_CONSTANT to “YES” opti- 
mizes the multiplier for resource usage 
and speed. The default is “NO.” 
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Re- 

Parameter Type quired Description 



Altera-specific parameter. Values are “ON,” “OFF,” 
and “UNUSED.” Setting the USE.EAB parameter to 
“ON” allows MaxPlusII to use EABs to imple- 
ment 4 x 4 or (8 x constant value) building 
blocks in FLEX 10K devices. Altera recommends 
that you set USE_EAB to “ON” only when LCELLS 
are in short supply. If you wish to use this pa- 
USE_EAB String No rameter, when you instantiate the function in a 
GDF, you must specify it by entering the pa- 
rameter name and value manually with the Edit 
Ports/Parameters dialog box (Symbol menu). You 
can also use this parameter name in a TDF or a 
Verilog design file. You must use the LPM_HINT 
parameter to specify the USE_EAB parameter in 
VHDL design files. 

Altera-specific parameter. Same as LPM_PIPELINE. 
(This parameter is provided only for backward 
LATENCY Integer No compatibility with MaxPlusII pre- version 7.0 de- 
signs. For all new designs, you should use the 
LPM.PIPELINE parameter instead.) 



MAXIMIZE. 

.SPEED Integer No 



Altera-specific parameter. You can specify a value 
between 0 and 10. If used, MaxPlusII attempts 
to optimize a specific instance of the lpm_mult 
function for speed rather than area, and overrides 
the setting of the Optimize option in the Global 
Project Logic Synthesis dialog box (Assign 
menu). If MAXIMIZE_SPEED is unused, the value of 
the Optimize option is used instead. If the setting 
for MAXIMIZE_SPEED is 6 or higher, the compiler 
will optimize lpm_mult megafunctions for higher 
speed; if the setting is 5 or less, the compiler will 
optimize for smaller area. 



LPM.HINT String 



Allows you to specify Altera-specific parameters 
in VHDL Design Files. The default is “UNUSED.” 



Note that specifying a value for MAXIMIZE_SPEED has an effect only if 

LPM_REPRESENTATION is set to “SIGNED.” 

Note that for Verilog LPM 220 synthesizable code (i.e., 220model.v) the 
following parameter ordering applies: lpm_type, lpm_widtha, lpm_widthb, 
lpm_widths, lpm_widthp, lpm_representation, lpm_pipeline, lpm_hint. 



Function 

The following table is an example of the UNSIGNED behavior in lpm_mult: 
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Inputs 


Outputs 


dataa datab sum 


| product 


a b s 


LPM_WIDTHP most significant 




bits of a * b + s 



Resource Usage 

The following table summarizes the resource usage for an lpm_mult function 
used to implement Tbit and 8-bit multipliers with LPM_PIPELINE = 0 and 
without the optional sum input. Logic cell usage scales linearly in proportion 
to the square of the input width. 



Design goals 




Design results 




Device family 


Optimization 


Width LCs 


Speed (ns) 


Notes 


FLEX 6K, 8K, 


Routability 


8 


121 


80 


Speed for 


and 10K 


Speed 


8 


163 


52 


EPF8282A-2 


FLEX 6K, 8K, 


Routability 


4 


29 


34 


Speed for 


and 10K 


Speed 


4 


41 


27 


EPF8282A-2 


MAX 5K, 7K, 


Routability 


4 


26 (11) 


23 


Speed for 


and 9K 


Speed 


4 


27 (4) 


19 


EPM7128E-7 



Numbers of shared expanders used are shown in parentheses. In the FLEX 
10K device family, the Tbit by Tbit multiplier example shown above can be 
implemented in a single EAB. 



B.2. 4 The Parameterized ROM Megafunction (lpm_rom) 

Altera recommends that you use the lpm.rom function to implement all ROM 
functions. The lpm_rom function is available only for FLEX 10K devices. We 
have used this megafunction for the designs fun_text, p. 23 and darom, 
p. 135. The MaxPlusII compiler automatically implements suitable portions 
of this function in EABs in FLEX 10K devices. Therefore, it is not necessary 
to use the “Implement in EAB” logic option for this function, and doing so 
may cause warning messages to appear. 

Altera recommends instantiating this function as described in “Creating 
a Custom Megafunction Variation with the MegaWizard Plug-In Manager.” 
You can use the genmem.exe utility to create a simulation model for this 
function for use in third-party simulators. Type genmem -h at a DOS prompt 
for information on how to use this utility. 

The port names and order for Verilog HDL prototype are: 

module lpm_rom ( address, inclock, out clock, memenab, 

q); 
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The VHDL component declaration is shown below: 

COMPONENT lpm.rom 

GENERIC (LPM.WIDTH : POSITIVE; 

LPM.TYPE : STRING := M L_R0M M ; 

LPM.WIDTHAD : POSITIVE; 

LPM_NUMW0RDS : POSITIVE; 

LPM.FILE : STRING; 

LPM_ADDRESS_C0NTR0L : STRING := "REGISTERED” ; 
LPM.OUTDATA : STRING := "REGISTERED"; 

LPM.HINT : STRING := "UNUSED"); 

PORT (address : IN STD_L0GIC_VECT0R(LPM_WIDTHAD-1 DOWNTO 0); 
inclock : IN STDJLOGIC := ’l’; 
outclock : IN STD.LOGIC := ’i’; 
memenab : IN STD.LOGIC := ’i’; 

q : OUT STD_L0GIC_VECT0R(LPM_WIDTH-1 DOWNTO 0) 

); 

END COMPONENT; 



Ports 



The following table displays all INPUT ports of lpm_rom: 



Port 

name 


Re- 
quired Description 


Comments 


address 


Yes 


Address input to 
the memory 


Input port LPM_WIDTHAD wide 


inclock 


No 


The address port is synchronous (regis- 
Clock for input tered) when the inclock port is connected, 
registers and is asynchronous (registered) when the 

inclock port is not connected. 


outclock 


No 


Clock for output 
registers 


The addressed memory content-to-q re- 
sponse is synchronous when the outclock 
port is connected, and is asynchronous 
when it is not connected. 


memenab 


No 


Memory enable High = data output on q, Low = high- 
input impedance outputs 



The following table displays all OUTPUT ports of lpm_rom: 



Port 

Name 


Re- 
quired Description 


Comments 


q 


Yes Output of memory 


Output port LPM_WIDTH wide 



Parameters 

The following table shows the parameters of the lpm_rom component: 
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Re- 

Parameter Type quired Description 



LPM_WIDTH Integer Yes Width of the q port. 



LPM.WIDTHAD Integer Yes 



Width of the address port. 
LPM_WIDTHAD should be (but is 
not required to be) equal to 
log 2 (LPM_NUMW0RDS). If LPM.WIDTHAD is 
too small, some memory locations will 
not be addressable. If it is too large, 
the addresses that are too high will 
return undefined logic levels. 



LPM.WUMWORDS Integer Yes 



Number of words stored in memory. 
In general, this value should be (but 
is not required to be) 2 LPM ~ WIDTHAD — 
1 < LPMJJUMWORDS < 2 LPH - WIDTHAD . If 
omitted, the default is 



Name of the Memory Initialization 
File (*.mif) or Hexadecimal (Intel- 
LPM_FILE String No Format) File (*.hex) containing ROM 

initialization data ( “<filename>”), or 

“UNUSED.” 



Values are “REGISTERED,” 

“UNREGISTERED,” and “UNUSED.” 
LPM_ADDRESS_C0NTR0L String No Indicates whether the address port is 

registered. If omitted, the default is 
“REGISTERED.” 



Values are “REGISTERED,” 

“UNREGISTERED,” and “UNUSED.” 



LPM.OUTDATA 


String 


No 


Indicates whether the q and eq ports 
are registered. If omitted, the default 
is “REGISTERED.” 


LPM_HINT 


String 


No 


Allows you to specify Altera-specific 
parameters in VHDL Design Files. The 








default is “UNUSED.” 


LPM.TYPE 


String 


No 


Identifies the LPM entity name in 
VHDL Design Files. 



Note that for Verilog LPM 220 synthesizable code (i.e., 220model . v) the follow- 
ing parameter ordering applies: lpm_type, lpm.width, lpm_widthad, lpm_numwords, 
lpm_address_control, lpm_outdata, lpm_f ile, lpm_hint. 



Function 

The following table shows the synchronous read from memory behavior of lpm_rom: 
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0UTCL0CK 


MEMENAB 


Function 


X 


L 


q output is high impedance (memory not enabled) 


i 


H 


No change in output 






The output register is loaded with 


r 


H 


the contents of the memory location 


J 


pointed to by address, q outputs 
the contents of the output register. 





Totally asynchronous memory operations occur when neither inclock nor 
out clock is connected. The output q is asynchronous and reflects the data in the 
memory to which address points. The following table shows the asynchronous 
memory operations behavior of lpm_rom: 



MEMENAB Function 

L q output is high-impedance (memory not enabled) 

H The memory location pointed to by address is read 



Resource Usage 

The Megafunction lpm_rom uses one embedded cell per memory bit. 



B.2.5 The Parameterized Divider Megafunction 
(lpm_divide) 

Altera recommends that you use lpm_divide to replace all other types of divider 
functions, including old-style divide macrofunction. We have used this megafunction 
for the array divider designs p. 77. 

Altera recommends instantiating this function as described in “Creating a Cus- 
tom Megafunction Variation with the MegaWizard Plug-In Manager.” 

The port names and order for Verilog HDL prototype are: 

module lpm_divide ( quotient, remain, numer, denora, 

clock, clken, aclr ) 

The VHDL component declaration is shown below: 

COMPONENT lpm_divide 

GENERIC ( LPM.WIDTHN: POSITIVE; 

LPM.WIDTHD: POSITIVE; 

LPM.NREPRESENTATION: STRING: = "UNSIGNED' 1 ; 
LPM.DREPRESENTATION: STRING: = "UNSIGNED"; 

LPM.TYPE: STRING : ="LPM_DIVIDE" ; 

LPM.PIPELINE: INTEGER := 0; 

LPM.HINT: STRING := "UNUSED"; 

); 

PORT ( numer: IN STD_L0GIC_VECT0R(LPMJHDTHN-1 D0WNT0 0); 

denom: IN STD_L0GIC_VECT0R(LPM_WIDTHD-1 D0WNT0 0); 
clock, aclr: IN STD_L0GIC := ’O’; 
clken: IN STD.LOGIC := ’1’; 

quotient: OUT STD_L0GIC_VECT0R(LPM_WIDTHN-1 D0WNT0 0) ; 
remain: OUT STD_L0GIC_VECT0R(LPM_WIDTHD-1 D0WNT0 0) 

); 

END COMPONENT; 




B.2 Library of Parameterized Modules (LPM) 507 



Ports 



The following table displays all INPUT ports of lpm_divide: 



Port 

name 


Re- 
quired Description 


Comments 


numer 


Yes 


Numerator 


Input port LPM_WIDTHN wide. 


denom 


Yes 


Denominator 


Input port LPM_WIDTHD wide. 


clock 


No 


Clock input 

pipelined usage. 


for You must connect the clock input if you 
set LPM_PIPELINE to a value other than 0. 


clken 


No 


Clock enable 
pipelined usage. 


for 


aclr 


No 


Asynchronous 

signal. 


clear ac ^ r P or t ma y use( i a t an y time to 

reset the pipeline to all Os asynchronously 
to the clock input. 



The following table displays all OUTPUT ports of lpm_divide: 



Port Re- 

Name quired Description Comments 

You must use either the 

quotient Yes utput P or ^ quotient or the remain 

ports. 

You must use either the 

remain Yes U . ^° r quotient or the remain 

LPM.WIDTHD wide. ^ , 

ports. 



Parameters 



The following table shows the parameters of the lpm_divide component: 




508 B. VHDL and Verilog Coding 



Parameter 


Type 


Re- 
quired Description 


LPM.WIDTHN 


Integer 


Yes 


Width of the numer and quotient port. 


LPM.WIDTHD 


Integer 


Yes 


Width of the denom and remain port. 


LPM_ 

_NREPRESENTATION String 


No 


Specifies whether the numerator 
is “SIGNED” or “UNSIGNED”. Only 
“UNSIGNED” is supported for now. 


LPM_ 

_D REPRESENTATION String 


No 


Specifies whether the denominator 
is “SIGNED” or “UNSIGNED”. Only 
“UNSIGNED” is supported for now. 


LPM_PIPELINE 


Integer 


No 


Specifies the number of Clock cy- 
cles of latency associated with the 
quotient and remain outputs. A value 
of zero (0) indicates that no latency ex- 
ists, and that a purely combinatorial 
function will be instantiated. If omit- 
ted, the default is 0 (nonpipelined). 
You cannot specify a value for the 
LPM.PIPELINE parameter that is higher 
than LPM.WIDTHN. 


LPM.TYPE 


String 


No 


Identifies the LPM entity name in 
VHDL Design Files. 


LPM_HINT 


String 


No 


Allows you to assign Altera-specific pa- 
rameters in VHDL Design Files. The 
default is “UNUSED.” 



You can pipeline a design by connecting the clock input and specifying the 
number of Clock cycles of latency with the LPM_PIPELINE parameter. 

Note that for Verilog LPM 220 synthesizable code (i.e., 220model.v) 
the following parameter ordering applies: lpm_type, lpm_widthn, lpm_widthd, 
lpm_nrepresentation, lpm_drepresentat ion, lpm_pipeline. 




C. Glossary 



ACC 


Accumulator 


ACT 


Actel FPGA family 


ADC 


Analog-to-digital converter 


ADCL 


All-digital CL 


ADF 


Adaptive digital filter 


ADPCM 


Adaptive differential pulse code modulation 


ADPLL 


All-digital PLL 


ADSP 


Analog Devices digital signal processor family 


AFT 


Arithmetic Fourier transform 


AHDL 


Altera HDL 


AM 


Amplitude modulation 


ALU 


Arithmetic logic unit 


AMD 


Advanced Micro Devices, Inc. 


ASCII 


American Standard Code for Information Interchange 


ASIC 


Application specific IC 


AWGN 


Additive white Gaussian noise 


BDD 


Binary decision diagram 


BLMS 


Block LMS 


BP 


Bandpass 


BRS 


Base removal scaling 


BS 


Barrelshifter 


CAE 


Computer-aided engineering 


CAST 


Carlisle Adams and Stafford Tavares 


CBC 


Cipher block chaining 


CBIC 


Cell-based IC 


CD 


Compact disc 


CFA 


Common factor algorithm 


CFB 


Cipher feedback 


CIC 


Cascaded integrator comb 


CL 


Costas loop 


CLB 


Configurable logic block 


CMOS 


Complementary metal oxide semiconductor 


CODEC 


Coder/decoder 


CORDIC 


Coordinate rotation digital computer 


COTS 


Commercial off-the-shelf technology 


CPLD 


Complex PLD 


CPU 


Central processing unit 


CQF 


Conjugate quadrature filter 
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CRNS 


Complex RNS 


CRT 


Chinese remainder theorem 


CSOC 


Canonical self-orthogonal code 


CSD 


Canonical signed digit 


CWT 


Continuous wavelet transform 


CZT 


Chirp- z transform 


DA 


Distributed arithmetic 


DAC 


Digital-to-analog converter 


DB 


Daubechies filter 


DC 


Direct current 


DCT 


Discrete cosine transform 


DCO 


Digital controlled oscillator 


DES 


Data encryption standard 


DFT 


Discrete Fourier transform 


DHT 


Discrete Hartley transform 


DIF 


Decimation in frequency 


DIT 


Decimation in time 


DLMS 


Delayed LMS 


DMT 


Discrete Morlet transform 


DPLL 


Digital PLL 


DSP 


Digital signal processing 


DST 


Discrete sine transform 


DWT 


Discrete wavelet transform 


EAB 


Embedded array block 


ECB 


Electronic code book 


ECL 


Emitter coupled logic 


EDIF 


Electronic design interchange format 


EFF 


Electronic Frontier Foundation 


EPF 


Altera FPGA family 


EPROM 


Electrically programmable ROM 


ERA 


Plessey FPGA family 


ERNS 


Eisenstein RNS 


ESA 


European Space Agency 


EVR 


Eigenvalue ratio 


FAEST 


Fast a posteriori error sequential technique 


FCT 


Fast Cosine transform 


FC2 


FPGA compiler II 


FF 


Flip-flop 


eft 


Fast Fourier transform 


FIR 


Finite impulse response 


FIFO 


First-in first-out 


FLEX 


Altera FPGA family 


FM 


Frequency modulation 


FNT 


Fermat NTT 


FPGA 


Field- programmable gate array 


FPL 


Field- programmable logic (combines CPLD and FPGA) 


FPLD 


FPL device 


FSF 


Frequency sampling filter 


FSK 


Frequency shift keying 


FSM 


Finite state machine 




Glossary 511 



GAL 


Generic array logic 


GF 


Galois field 


HB 


Half-band filter 


HI 


High frequency 


HDL 


Hardware description language 


HSP 


Harris Semiconductor DSP ICs 


IBM 


International Business Machines (corporation) 


IC 


Integrated circuit 


IDCT 


Inverse DCT 


IDEA 


International data encryption algorithm 


IDFT 


Inverse discrete Fourier transform 


IEEE 


Institute of Electrical and Electronics Engineers 


IF 


Inter frequency 


IFFT 


Inverse fast Fourier transform 


HR 


Infinite impulse response 


INTT 


Inverse NTT 


ITU 


International Telecommunication Union 


JPEG 


Joint Photographic Experts Group 


KLT 


Karhunen-Loeve transform 


LAB 


Logic array block 


LAN 


Local area network 


LC 


Logic cell 


LE 


Logic element 


LF 


Low frequency 


LFSR 


Linear feedback shift register 


LMS 


Least-mean-square 


LNS 


Logarithmic number system 


LO 


Low frequency 


LP 


Low pass 


LPM 


Library of parameterized modules 


LRS 


serial left right shifter 


LS 


Least- square 


LSB 


Least significant bit 


LSI 


Large scale integration 


LTI 


Linear time-invariant 


LUT 


Look-up table 


MAC 


Multiplication and accumulate 


MACH 


AMD/Vantis FPGA family 


MAG 


Multiplier adder graph 


MAX 


Altera CPLD family 


MIF 


Memory initialization file 


MLSE 


Maximum likelihood sequence estimator 


MNT 


Mersenne NTT 


MPEG 


Moving Picture Experts Group 


MPX 


Multiplexer 


MSPS 


Millions of sample per second 


MRC 


Mixed radix conversion 
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MSB 

MUL 

NLMS 

NP 

NRE 

NTT 

OFB 

PAM 

PC 

PD 

PDSP 

PFA 

PLA 

PLD 

PLL 

PM 

PREP 

PRNS 

PROM 

PSK 

QDFT 

QLI 

QFFT 

QMF 

QRNS 

RAM 

RC 

RF 

RISC 

RLS 

RNS 

ROM 

RPFA 

RS 

RSA 

SD 

SC 

SLMS 

SM 

SNR 

SPED 

SPT 

SR 

SRAM 

STFT 



Most significant bit 
Multiplication 

Normalized LMS 
Nonpolynomial complex problem 
Nonreccurring engineering costs 
Number theoretic transform 

Open feedback (mode) 

Pulse- amplitude-modulated 
Personal computer 
Phase detector 

Programmable digital signal processor 
Prime factor algorithm 
Programmable logic array 
Programmable logic device 
Phase-locked loop 
Phase modulation 

Programmable Electronic Performance (cooperation) 
Polynomial RNS 
Programmable ROM 
Phase shift keying 

Quantized DFT 
Quick look-in 
Quantized FFT 
Quadrature mirror filter 
Quadratic RNS 

Random access memory 
Resistor/capacity 
Radio frequency 

Reduced instruction set computer 

Recursive least square 

Residue number system 

Read only memory 

Rader prime factor algorithm 

serial right shifter 

Rivest, Shamir, and Adelman 

Signed digit 
Stochastic gradient 
Signed LMS 
Signed magnitude 
Signal-to-noise ratio 
Simple PLD 
Signed power of two 
Shift register 

Static random access memory 
Short term Fourier transform 

Transform domain LMS 



TDLMS 
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TLU 


Table look-up 


TMS 


Texas Instruments DSP family 


TI 


Texas Instruments 


TTL 


Transistor transistor logic 


UART 


Lhiiversal asynchronous receiver/transmitter 


VCO 


Voltage-control oscillator 


VHDL 


VHSIC hardware description language 


VHSIC 


Very high speed integrated circuit 


VLSI 


Very large integrated ICs 


WFTA 


Winograd Fourier transform algorithm 


wss 


Wide sense stationary 


xc 


Xilinx FPGA family 


XNOR 


exclusive NOR gate 




D. CD-ROM File 



“lreadme.ps” 



The accompanying CD-ROM includes 

• A full version of the MaxPlusII software 

• Altera’s digital library (Version September 2003) with many application notes, 
data sheets and manuals 

• All VHDL/Verilog design examples and utility programs and hies. 

For instance, you will find under literature/manual/81_gs .pdf the full “Getting 
Started” manual, which includes detailed tutorials on MaxPlusII, AHDL, and the 
graphical design entry. Click on index.html to get an overview of the full set 
of Altera documentation. To install the MaxPlusII 10.2 student edition software 
first read the hies ins-student .htm from Altera’s web page www . altera, com on 
the CD-ROM. Then just start the self-extracting hie studentl02.exe on the CD- 
ROM. After the installation user must register the software through Altera’s web 
page at www.altera.com in order to get a license key. If you are not a student 
with a University you may consider downloading Altera’s “baseline” software, that 
will need then an additional VHDL design entry. You can download the Leonardo 
software from Altera’s web page or use FC2 to produce an EDIF hie you then 
compile with the MaxPlusII baseline package. 

The design examples for the book are located in the directories book2e/vhdl and 
book2e/verilog for VHDL and Verilog examples, respectively. These directories 
contain, for each example, the following three hies: 

• The VHDL or Verilog source code (*.vhd and *.v) 

• The Assignment and Conhguration File (*.acf) 

• The Simulator Channel File (*.scf) 

For the design fun_graf, the “Graphic Design File” (*.gdf) is included in 
book2e/vhdl. For the examples that utilize EABs (i.e., fun_text and darom), the 
“Memory Initialization File” (*.mif) and the same hie in Intel hex format (*.hex) 
can be found on the CD-ROM. To simplify the compilation and postprocessing, the 
source code directories include some additional (*.bat) hies shown below: 



File 


Comment 


maxplus . bat 


Script to compile all design examples. Note that 
the installation path of MaxPlusII is assumed to be 
C:\maxplusl0p2\maxplus2.exe. Change the path glob- 
ally (if necessary) with an editor before running the 
script. 


clean.bat 


Cleans all temporary compiler hies, but not the report 
hies (*.rep) and timing analyzer output hies (*.tao) 


veryclean.bat 


Cleans all temporary compiler hies, including the report 
hies (*.rep) and timing analyzer output hies (*.tao) 
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Use maxplus.bat to compile all design examples and then clean.bat to remove 
the unnecessary files. The results for all examples are summarized in the following 
files under book2e/util. 



File 


Description 


Lcs . vhd 


Pins, memory usage and LC count for all VHDL examples 
as shown in the report files *.rpt. 


Mhz . vhd 


“Worst case” path name, delay, and Registered 
Performance in MHz for all VHDL examples, as shown 
in the timing analyzer output files, *.tao. 


Lcs . v 


Pins, memory usage and LC count for all Verilog exam- 
ples, as shown in the report files, *.rpt. 


Mhz . v 


“Worst case” path name, delay, and Registered 
Performance in MHz for all Verilog examples, as shown 
in the timing analyzer output hies, *.tao. 



Using Compilers Other Then MaxPlusII 
FPGA_CompilerII 

The main advantage of using the FPGA_CompilerII (FC2) from Synopsys is that it 
is now possible to synthesize examples for other devices like Xilinx, Vantis, Actel, 
or QuickLogic. The TCL scripts vhdl.fc2, and verilog.fc2, respectively, provide 
the necessary commands for the shell mode of FC2, i.e., f c2_shell. 

Using FPGA_CompilerII and VHDL. The FPGA_CompilerII script 
vhdl.fc2 to compile the VHDL examples is shown in the following: 

# 

# Synopsys FPGA Compiler II VHDL simulation script vhdl.fc2 

# for the book: DSP with FPGAs (2. edition) 

# Author-EMAIL : Uwe.Meyer-Baese@ieee.org 

# 

# Usage: fc2_shell -f vhdl.fc2 

# 

create_pro ject -dir . fc2 

#*** Chapter 1 entitys: 
set chi "example fun_text" 

#*** Chapter 2 components: 

set ch2 M csa7 csa7cin add_ff8 add_ff8cin" 

#*** Chapter 2 entitys: 

set ch2e "add_lp add_2p add_3p mul_ser cordic" 

#*** Chapter 3 components: 
set ch3 "case3 case5p case3s" 

#*** Chapter 3 entitys: 

set ch3e "fir_gen fir_srg dafsm darom dasign dapara" 

#*** Chapter 4 entitys: 

set ch4 "iir iir_pipe iir.par" 

#*** Chapter 5 entitys: 

set ch5 M cic3r32 cic3s32 db4poly db41atti" 
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#*** Chapter 6 entitys: 

set ch6 M rader7 ccmul bfproc" 

#*** Chapter 7 entitys: 
set ch7 M lfsr Ifsr6s3 ammod" 

#*** 2. edition entitys: 

set e2 "div_res div_aegp fir_lms fir6dlms M 
#*** Make a single list of all VHDL files 

set files "$chl $ch2 $ch2e $ch3 $ch3e $ch4 $ch5 $ch6 $ch7 $e2 M 

#*** Compile all files in the list and report results 
foreach current $files { 
add_file $current.vhd 
analyze_f ile 

create_chip -progress -name $current \ 

-target FLEX10K -device EPF10K70RC240 -speed -4 \ 
-frequency 30 $current 
current_chip $current 
opt imize_chip -name $current-opt 
report_chip > $current.rpt 
export_chip -dir fc2 

} 

# 

#*** You may select similar target devices from other 
#*** vendors. Use the following options for the 
#*** M create_chip M command: 

# Xilinx: -target XC4000E -device 4013EPG223 -speed -4 

# Vantis -target VF1 -device VF1020AMYTC -speed 1 

# Actel: -target A1400 -device A14100BPRQ208 -speed STD 

# QuickLogic: 

# -target QLOGIC -device QL3040-PQ240 -speed -4 

# 

quit 



The script produces a report files for each design that shows the device utiliza- 
tion and estimated Registered Performance. Unfortunately, these speed data are 
not very exact and it is better to import the three files produced by FC2, namely 

• The EDIF file (*.edf) 

• The Assignment and Configuration File (*.acf) 

• The Library Mapping File (*.lmf ) from Synopsys 

into MaxPlusII and compile the EDIF file in order to get exact results. 

Using FPGA_CompilerII and Verilog. Using the Verilog interface in com- 
bination with FPGA_CompilerII requires some additional effort, because FC2 does 
not support the definition of the parameter of a component using def param. There 
are, in general, two ways to alter parameter values: the defparam statement , which 
allows assignment to parameters using their hierarchical names (supported by Max- 
PlusII, but not by FC2), and the module instance parameter value assignment , which 
allows values to be assigned inline during module instantiation (supported by FC2, 
but not by MaxPlusII). 
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In MaxPlusII we have used the defparam statement in, for instance, the design 
fun_text . v, p. 436: 

lpm_rom roml 

( .q(sin) , . inclock (elk) , . outclock (elk) , 

. address (msbs) ) ; // Used ports 
// .memenab(ena) ) ; // Unused port 

defparam roml . lpm_width = 8; 
defparam roml . lpm_widthad = 8; 
defparam roml . lpm_f ile = "sine.mif"; 

For FC2 this has to be coded as 

lpm_rom 

#(8,8,256, "UNREGISTERED" , "UNREGISTERED" , "sine . hex" ) roml 
( .q(sin) , . inclock (elk) , . outclock (elk) , 

. address (msbs) , // Used ports 

.memenab (ena) ) ; // Unused port 

The disadvantage with this coding is that we must be very careful with the order 
of the parameters. Note also that the first parameter for FC2 is not lpm_type, as 
defined in the Verilog LPM 220 synthesizable code, i.e. , 220model.v (see p. 494). 

The modified design files can be found in the directory book2e/verilog/f c2 
along with the TCL script verilog. fc2 to be used with fc2_shell. 



Model Technology 

By using the synthesizable public domain models provided by the ED IF organiza- 
tion (at www.edif.org), it is also possible to use other VHDL/ Verilog simulators 
than MaxPlusII. 

Using MTI and VHDL. For VHDL, the two files 220pack.vhd and 
220model . vhdmust first be compiled. For the ModelSim simulator vs im from Model 
Technology Inc., the script mti_vhdl.do can be used for a device-independent com- 
pilation and simulation of the design examples. The script is shown below: 

# 

# Model Technology VHDL compiler script for the book 

# Digital Signal Processing with FPGAs (2. edition) 

# Author-EMAIL : Uwe.Meyer-Baese6ieee.org 

# 

echo Create Library directory 1pm 
vlib 1pm 

echo Compile 1pm package. 

vcom -work 1pm -explicit -quiet 220pack.vhd 220model.vhd 

echo Compile chapter 1 entitys. 

vcom -work 1pm -quiet example. vhd fun_text.vhd 

echo Compile chapter 2 components. 

vcom -work 1pm -explicit -quiet csa7.vhd csa7cin.vhd 
vcom -work 1pm -explicit -quiet add_ff8.vhd add_f f 8cin. vhd 
echo Compile chapter 2 entitys. 
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vcom -work 1pm -explicit -quiet add_lp.vhd add_2p.vhd 
vcom -work 1pm -explicit -quiet add_3p.vhd mul_ser.vhd 
vcom -work 1pm -explicit -quiet cordic.vhd 

echo Compile chapter 3 components. 

vcom -work 1pm -explicit -quiet case3.vhd case5p.vhd 
vcom -work 1pm -explicit -quiet case3s.vhd 
echo Compile chapter 3 entitys. 

vcom -work 1pm -explicit -quiet fir_gen.vhd fir_srg.vhd 
vcom -work 1pm -explicit -quiet dafsm.vhd darom.vhd 
vcom -work 1pm -explicit -quiet dasign.vhd dapara.vhd 

echo Compile chapter 4 entitys. 

vcom -work 1pm -explicit -quiet iir.vhd iir_pipe.vhd 
vcom -work 1pm -explicit -quiet iir_par.vhd 

echo Compile chapter 5 entitys. 

vcom -work 1pm -explicit -quiet cic3r32.vhd cic3s32.vhd 
vcom -work 1pm -explicit -quiet db4poly.vhd db41atti.vhd 

echo Compile chapter 6 entitys. 

vcom -work 1pm -explicit -quiet rader7.vhd ccmul.vhd 
vcom -work 1pm -explicit -quiet bfproc.vhd 

echo Compile chapter 7 entitys. 

vcom -work 1pm -explicit -quiet rader7.vhd ccmul.vhd 
vcom -work 1pm -explicit -quiet bfproc.vhd 

echo Compile 2. edition entitys. 

vcom -work 1pm -explicit -quiet div_res.vhd div_aegp.vhd 
vcom -work 1pm -explicit -quiet fir_lms.vhd fir6dlms.vhd 

Start the ModelSim simulator and then type 

do mti_vhdl.do 

to execute the script. 

Using MTI and Verilog. Using the Verilog interface with the 1pm library from 
EDIF, i.e., 220model.v, needs some additional effort. When using 220model.vit 
is necessary to specify all ports in the Verilog 1pm components. There is an extra 
directory book2e/ver ilog/mt i, that provides the design examples with a full set 
of 1pm port specification. The designs use 

‘\include M 220model.v" 

at the beginning of each Verilog file to include the 1pm components, if necessary. 
Use the script mti_v.csh to compile all Verilog design examples with Model Tech- 
nology's vcom compiler. 

In order to load the “Memory Initialization File” (*.mif), it is required to 
be familiar with the “Programming Language Interface” (PLI) of the Verilog 1364- 
1995 IEEE standard (see LRM Sec. 17, p. 228 ff). With this powerful PLI interface, 
conventional C programs can be dynamically loaded into the Verilog compiler. In 
order to generate a dynamically loaded object of the program convert _hex2ver . c, 
the path for the include files veriuser .h and acc.user.h must be specified. Use 
-I, when using the gcc or cc compiler under SUN Solaris. Using, for instance, the 
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gcc compiler under SUN Solaris for the Model Technology Compiler, the following 
commands are used to produce the shared object: 

gcc -c -I/<install_dir>/modeltech/ include convert_hex2ver . c 
Id -G -B symbolic -o convert_hex2ver . si convert_hex2ver . o 

By doing so, Id will generate a warning “Symbol referencing errors,” because all 
symbols are first resolved within the shared library at link time, but these warnings 
can be ignored. 

It is then possible to use these shared objects, for instance with Model Tech- 
nology vsim in the first design example fun_text.v, with 

vsim -pli convert_hex2verl . si lpm.fun_text 

To learn more about PLIs, check out Verilog IEEE standard 1364-1995, or the 
vendor’s User’s Manual of your Verilog compiler. 

We can use the script mti_v.cshin order to compile all Verilog examples with 
MTI’s vlog. But vlog does not perform a check of the correct component port 
instantiations or shared objects. A second script, mti_v.do, can be used for this 
purpose. Start the vsim simulator (without loading a design) and execute the “DO” 
file with 

do mti_v.do 

to perform the check for all designs. 



Utility Programs and Files 

A couple of extra utility programs are also included on the CD-ROM and can be 
found in the directory book2e/util. These are the following programs: 



File 


Description 


sine . exe 


Program to generate the MIF files for the function gen- 
erator in Chap. 1. 


csd. exe 


Program to find the “Canonical Signed Digit” represen- 
tation of integers or fractions as used in Chap. 2. 


fpinv . exe 


Program to compute the floating-point tables for recip- 
rocals as used in Chap. 2. 


dagen. exe 


Program to generate VHDL code for the distributed 
arithmetic hies used in Chap. 3. 


cic . exe 


Program to compute the parameters for a CIC filter as 
used in Chap. 5. 



The programs are compiled using the author’s MS Visual C++ “Standard Edi- 
tion” software (available for $50-100 at all major retailers) for DOS window ap- 
plications and should therefore run on Windows 95 or higher. The DOS script 
Testall.bat produces the examples used in the book. 

Also under book2e/util we find the following utility files: 
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File 


Description 


quickver.pdf 

quickvhd.pdf 

quicklog.pdf 


Quick reference card for Verilog HDL from QUALIS 
Quick reference card for VHDL from QUALIS 
Quick reference card for IEEE 1164 logic package from 
QUALIS 


93vhdl . vhd 


The IEEE VHDL 1076-1983 keywords 


95key . v 
95direct . v 
95tasks . v 


The IEEE Verilog 1364-1995 keywords 

The IEEE Verilog 1364-1995 compiler directives 

The IEEE Verilog 1364-1995 system tasks and functions 



In addition, the CD-ROM includes a collection of useful Internet links (see file 
dsp4fpga.html under book2e/util), such as device vendors, software tools, VHDL 
and Verilog resources, and links to on-line available HDL introductions, e.g., the 
“Verilog Handbook” by Dr. D. Hyde and “The VHDL Handbook Cookbook” by Dr. 
P. Ashenden. Altera’s VHDL, Verilog, and AHDL manuals are available through 
Altera’s literature service (set of all 3 at $89), and may later become available 
through Altera’s Web page. 
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- RLS 410, 418 

- Widrow-Hoff LMS 376 

- Winograd DFT 258 

- Winograd FFT 274 
Altera 7, 17 

AMD 7 

Bartlett window 119, 243 
Biiective 189 
Bitreverse 283 
Blackman window 119, 243 
Blowfish 340 
Butterfly 263, 268 



CAST 340 

Chirp - z algorithm 248 
CIC filter 187-203 

- RNS design 190 
Coding bounds 311 
Codes 

- block 

- - decoders 314 

- - encoder 313 

- convolutional 

- - comparison 324 

- - complexity 323 

- - decoder 318, 322 

- - encoder 318, 322 

- tree codes 317 
Contour plot 378 
Convergence 378, 378, 379 

- time constant 379 
Convolution 

- Bluestein 283 

- cyclic 283 

- linear 88, 109 
Cooley-Tuckey 

- FFT 265 

- NTT 297 

CORDIC algorithm 94-103 
Costas loop 

- architecture 358 

- demodulation 358 

- implementation 360 
CPLD 6, 4 

Cryptography 324-340 
Cypress 7 

Daubechies 216, 220, 231, 238, 239 
Data encryption standard (DES) 334 
340 
DCT 

- definition 279 

- fast implementation 282 




524 Index 



- 2D 280 

- JPEG 280 
Decimation 175 
Decimator 

- CIC 190 

- HR 167 
Demodulator 346 

- Costas loop 358 

- I/Q generation 347 

- zero IF 348 

- PLL 353 
DFT 

- computation using 

- - NTT 305 

- - Walsh-Hadamard transformation 
305 

- - AFT 305 

- definition 242 - inverse 242 

- filter bank 211 

- Rader 261 

- real 245 

- Winograd 258 

Digital signal processing (DSP) 2, 90 
Discrete 

- Cosine transform, see DCT 282 

- Fourier transform, see DFT 24-2 

- Hartley transform 285 

- Sine transform (DST) 279 

- Wavelet transform (DWT) 233-238 
Distributed arithmetic 88-94 

- Optimization 

- - Size 93 

- - Speed 94 

- signed 137 
Divider 63-76 

- array 

- - performance 77 

- - size 78 

- convergence 74 

- fast 72 

- LPM 77, 491, 506 

- nonperforming 70, 106 

- nonrestoring 71, 106 

- restoring 67 

- types 66 
Dyadic DWT 234 

Eigenfrequency 189 

Eigenvalues ratio 382, 383, 389, 390, 412 

Encoder 313, 318, 322 

Error 

- control 306-324 

- cost functions 370 



- residue 373 

Fast RLS algorithm 10 
Fermat NTT 295 
Filter 109-171 

cascaded integrator comb (CIC) 
187-203 

- causal 114 

- CSD code 123 - conjugate mirror 224 

- distributed arithmetic (DA) 128 

- finite impulse response (FIR) 109-143 

- frequency sampling 207 

- infinite impulse response (HR) 148-144 

- lattice 225 

- polyphase implementation 180 

- signed DA 137 

- symmetric 116 

- transposed 111 

- recursive 210 
Filter bank 

- constant 

- - bandwidth 230 

- - Q 230 

- DFT 211 

- two-channel 215 — 229 

- - aliasing free 218 

- - Haar 218 

- - lattice 225 

- - linear- phase 228 

- - lifting 223 

- - QMF 215 

- - orthogonal 225 

- - perfect reconstruction 218 

- - polyphase 224 

- - mirror frequency 215 

- - comparison 229 
Filter design 

- Butterworth 154 

- Chebyshev 155 

- Comparison of FIR to HR 148 

- elliptic 154 

- equiripple 121 

- frequency sampling 207 

- Kaiser window 118 

- Parks-McClellan 121 

Finite impulse response (FIR), see 

Filter 109-143 

Flip-flop 

- LPM 14, 23, 55, 54, 54, 490 
Floating-point 

- addition 83 

- arithmetic 76 

- conversion to fixed-point 79 
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- division 84 

- multiplication 81 

- numbers 47 

- recipocal 86 

- synthesis results 87 
FFT 

- comparison 277 

- Good-Thomas 260 

- group 263 

- Cooley-Tukey 265 

- in-place 278 

- index map 259 - Radix- r 263 

- stage 263 

- Winograd 274 
FPGA 

- Altera Flex 17 

- architecture 6 

- benchmark 9 

- Compilerll 516 

- design compilation 25 

- floor plan 25 

- graphical design entry 23 

- performance analysis 27 

- power dissipation 12 

- registered performance 27 

- routing 5, 19 

- simulation 26 

- size 17, 18 

- technology 7 

- timing 20 

- waveform files 28 

- Xilinx XC4000 17 

FPL, see FPGA and CPLD 
Fractal 237 

Frequency sampling filter 207 

Galois Field 311 
Gauss primes 45 
Generator 43, 44 
Gibb’s phenomenon 118 
Good-Thomas 

- FFT 260 

- NTT 297 

Goodman/Carey half-band filter 205, 
219, 238 
Gradient 375 

Hadamard 362 
half-band filter 

- decimator 206 

- factorization 219 

- Goodman and Carey 205, 219 

- definition 204 



Hamming window 119, 243 
Hann window 119, 243 
Hogenauer filter, see CIC 
Homomorphism 187 

IDEA 340 

Identification 369, 381, 391 
Isomorphism 187 
Image compression 280 
Index 43 

- multiplier 44 

- maps 

- - in FFTs 259 

- - in NTTs 297 

Infinite impulse response (HR) filter 
148-171 

- finite wordlength effects 161 

- fast filtering using 

- - time-domain interleaving 163 

- - clustered look-ahead pipelining 165 

- - scattered look-ahead pipelining 166 

- - decimator design 168 

- - parallel processing 169 

- - RNS design 171 
In-place 278 

Interference cancellation 366, 409 
Inverse 

- multiplicative 260 

- additive 292 

- system modeling 368 

JPEG, see Image compression 
Kaiser 

- window 243 

- window filter design 119 
Kalman gain 408, 410, 413 
Kronecker product 274 

Learning curves 381 

- RLS 407, 410 
LPM 

- add_sub 14, 23, 55, 265, 491, 494 

- divider 77, 491, 506 

- flip-flop 14, 23, 55, 54, 54, 490 

- multiplier 111, 392, 399, 265, 491, 499 

- ROM 23, 135, 491, 503 
Lifting 223 

Linear feedback shift register 326 
LMS algorithm 376, 418 

- normalized 384, 385 

- design 393, 

- pipelined 396 
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- - delayed 396 

- - design 399, 402 

- - look-ahead 398 

- - transposed 399 

- - block FFT 388 

- simplified 403, 405 

- - error floor 404 
MAC 54 

Mersenne NTT 296 
Mobius function 305 
Multiplier 

- adder graph 127, 161 

- array 61 

- block 65 

- Booth 104 

- complex 105, 265 

- FPGA array 62 

- floating-point 81, 87 

- index 44 

- LPM 111, 392, 399, 265, 491, 499 

- performance 63 

- QRNS 45 

- quarter-square 171 

- serial/parallel 60 

- size 64 
Modulation 341 

- using CORDIC 345 
Modulo 

- adder 44 

- multiplier 44, 

- reconstruction 202 

NAND 5, 27 
Number representation 

- canonical signed digit (CSD) 36, 161 

- diminished by one (Dl) 35, 293 

- floating-point 50 

- one’s complement (1C) 35, 293 

- two’s complement (2C) 35 

- sign magnitude (SM) 35 
Number theoretic transform 289-305 

- Agarwal-Burrus 298 

- convolution 293 

- definition 289 

- Fermat 295 

- Mersenne 296 

- wordlength 296 

Order 

- filter 110 

- for NTTs 296 
Ordering, see index map 
Orthogonal 



- wavelet transform 220 

- filter bank 225 

Perfect reconstruction 218 
Phase-locked loop (PLL) 

- with accumulator reference 349 

- demodulator 354 

- digital 356 

- implementation 355, 357 

- linear 353 
Plessey ERA 5 
Pole/zero diagram 167, 225 
Polyphase representation 180, 221 
Power 

- dissipation 12 

- estimation 383, 386 

- line hum 373, 375, 378, 380, 392, 403 
Prediction 367 

- forward 413 

- backward 414 
Prime number 

- Fermat 291 

- Mersenne 291 
Primitive element 43 
Programmable signal processor 2, 11, 
87 

Public key systems 340 

Quadratic RNS (QRNS) 45 
Quadrature Mirror Filter (QMF) 215 

Rader 

- DFT 261 

- NTT 301 
RC5 340 

Rectangular window 119, 243 
Reduced adder graph 127, 161 
RLS algorithm 406, 410, 416 
RNS 

- CIC filter 190 

- complex 46 

- HR filter 171 

- Quadratic 45 

- scaling 202 
ROM 

- LPM 23, 135, 491, 503 
RSA 340 

Sampling 

- Frequency 243 

- Time 243 

Sea of gates Plessey ERA 5 
Self-similar 234 
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Simulator 

- ModelTechnology 518 
Step size 380, 379, 381, 389 
Subband filter 211 
Symmetry 

- in filter 116 

- in cryptographic algorithms 340 
Synthesizer 

- accumulator 22 

- PLL with accumulator 349 

Theorem 

- Chinese remainder 43 
Two-channel filter bank 215-229 

- comparison 229 

- lifting 223 

- orthogonal 224 

- QMF 224 

- polyphase 221 
Transformation 

- arithmetic Fourier 305 

- continuous Wavelet 234 

- discrete cosine 282 

- discrete Fourier 242 

- - inverse (IDFT) 242 

- discrete Hartley 285 

- discrete Wavelet 233-238 

- domain LMS 388 

- Fourier 243 

- Fermat NTT 295 

- pseudo-NTT 297 

- short-time Fourier (STFT) 230 

- discrete sine 279 

- Mersenne NTT 296 

- number theoretic 289-305 

- Walsh-Hadamard 305 
Triple DES 339 
Verilog 

- key words 487 
VHDL 

- styles 14 

- key words 487 

Walsh 361 
Wavelets 233-238 

- continuous 234 
linear phase 228 

- orthogonal 220 

Widrow-Hoff LMS algorithm 376 
Wiener-Hopf equation 372 
Windows 119, 243 
Winograd DFT algorithm 258 
Winograd FFT algorithm 274 



Wordlength 
-HR filter 161 
- NTT 296 

Zech logarithm 45 




